The use of Artificial Intelligence and machine learning models can lead to high costs, especially if the underlying application scales to many users. This is because these models are computationally intensive, and their responses are often expected to be delivered quickly. However, with the right settings, significant cost savings can be achieved, and applications can be scaled cost-efficiently.
Our mission and passion lie in deployment solutions with a strong focus on resource efficiency. To support this goal, we offer tools that empower companies to significantly reduce the costs associated with using artificial intelligence. In the following sections, we outline how we make this possible.
Multi-Model Deployment
A common deployment strategy involves dedicating a separate server to each required model. However, this approach often leads to underutilized or inefficiently used computing resources, resulting in high server costs. There are two main reasons why resources are wasted if only a single model is used per server:
Firstly, the server capacities for many models are too large to be used by them alone. For example, computer vision models usually range from 100 to 500 megabytes in size, while GPU servers typically start with 16GB of memory. This disparity often leads to servers being oversized, even when handling large batch sizes or multiple replications of the same model. Models are placed on a GPU to deliver fast response times. In many cases, the number of requests for a single model is insufficient to fully utilize the capacity of an entire GPU server. This leads to underutilized computing resources that still incur costs.
Secondly, modern applications often use several machine learning models, which are executed in a complex sequence and / or combined with each other. This is the case, for example, with model ensembles and model pipelines. If the models that are required in such a setting are deployed on different servers, it leads to a communication overhead. This leads to higher resource consumption and inference latency. It is most efficient to keep models that communicate with each other on the same server. Ideally, they should even share the same GPU, allowing intermediate results to remain directly in memory. Copying data to or from the GPU memory is often the bottleneck in modern AI-based applications, and should be avoided if possible.
Deploying multiple models on a single server comes with some challenges. For example, the correct server capacity must be selected so that the latency requirements can still be met. In addition, resource management (i.e. the allocation of storage space and computing capacity) is a complex issue. With our toolbox, we offer a solution to all problems along the deployment pipeline that can be used without expert knowledge. All you have to do is enter the model parameters as well as possible ensemble / pipeline interactions into our web app. Based on this, we give you suggestions on which hardware best suits your requirements, provide a ready-to-use container that you can use on-premise, or host the models in an optimized way on our servers for you.
Optimizing Models for Inference
Machine learning models are often taken from the software development department and deployed directly to the servers. However, this leaves a great deal of optimization potential untapped. This is because there is great potential for savings if the model is optimized for inference. For example, optimization can make the inference time of the model up to 36 times faster. Many model optimization techniques do not change the output values of the model and therefore do not lead to a loss of performance.
Below are a few examples of possible model optimization techniques:
- Layer Fusion: Combining multiple computational instructions into a single operation to reduce memory usage, improve runtime efficiency, and minimize latency during model inference.
- Execution graph optimization: Models are often defined by building blocks, which are executed in parallel or after each other (like for example layers in the case of neural networks). The execution of these building blocks can often be optimized for faster and more efficient computations.
- Hardware specific optimizations: Often, some computations can be made much more efficient by using hardware accelerators and instruction sets the right way. This needs to be done on a case-by-case basis after analysing the available hardware.
It is a challenge to maintain an overview of possible optimization methods. Moreover, it is often not possible to know which methods will bring improvements for a given model. In our toolbox, your models are automatically analyzed and optimized for inference and deployment. We use techniques that do not change the output of your models, but potentially make them significantly faster and less resource-intensive.
Setting deployment parameters right
There is also great potential for optimization in the way models are deployed. We differentiate between deployment parameters and deployment tools. Deployment parameters determine, for example, what the maximum batch size should be for each model in order to use the server as efficiently as possible. Selecting the right deployment tools, on the other hand, relates primarily to the infrastructure on the server, such as the API that forwards requests to the models.
The most important deployment parameters are the maximum batch size and the number of concurrent executions of each model. Both parameters directly influence the memory requirements of the model, as well as the required inference time. Under-dimensioning can result in long waiting times before a request to the model can be executed. Overdimensioning, on the other hand, wastes valuable server resources, as this results in idling servers. Selecting the correct deployment parameters can be challenging, especially in the context of multiple models being deployed to the same server. That is, because parameters must be selected depending on the available resources and the expected number of requests for each model.
A whole stack of tools is required to deploy models to a server. These include, for example, the API, which receives the requests for the models, or the frameworks, which execute the models to answer the requests. We go into the required components in more detail on this page. In our toolbox, we offer an end-to-end solution in which we have greatly optimized the deployment stack. We have focused on resource efficiency and speed, which can lead to significantly faster inference times compared to other solutions.
We see deployment as a multidimensional optimization problem. With our toolbox, we offer tools and methods to help you find a good solution to this problem. You can use these without any expert knowledge, as we have integrated the optimizations as automated components into our pipeline.
Get in Touch!
Partner with us to optimize your models and unlock their full potential. Share your requirements, and we’ll show you how we can drive your success.
Need more information? Reach out today to discuss how we can help you achieve your goals.