{"id":2354,"date":"2024-07-22T11:47:16","date_gmt":"2024-07-22T09:47:16","guid":{"rendered":"https:\/\/dev-techl.eu\/?p=2354"},"modified":"2025-04-01T15:27:13","modified_gmt":"2025-04-01T13:27:13","slug":"platforms-and-infrastructure-to-operate-genai-in-your-companys-basement","status":"publish","type":"post","link":"https:\/\/www.united-innovations.eu\/de\/2024\/07\/22\/platforms-and-infrastructure-to-operate-genai-in-your-companys-basement\/","title":{"rendered":"Platforms and Infrastructure to Operate GenAI in Your Company&#8217;s  Basement"},"content":{"rendered":"<div data-elementor-type=\"wp-post\" data-elementor-id=\"2354\" class=\"elementor elementor-2354\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-918d1ed elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"918d1ed\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f0a37cd\" data-id=\"f0a37cd\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b689ed2 elementor-widget elementor-widget-text-editor\" data-id=\"b689ed2\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><\/p><p>Hosting, operating and monitoring generative AI (GenAI) solutions is challenging, in particular if cloud resources provided by OpenAI or Azure cannot deliver in terms of privacy and cost efficiency. How can companies build and operate platforms for hosting foundation models as part of a GenAI solution on their own?<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><strong>An article by Dennis Wegener &amp; Benny Stein, Fraunhofer Institute for Intelligent Analysis and Information Systems<span class=\"Apple-converted-space\">\u00a0<\/span><\/strong><\/p><p><span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><strong>In recent years, generative AI has revolutionized various industries by enabling the creation of highly sophisticated and creative outputs. However, the journey to harnessing the full potential of generative AI has just begun, especially for organizations opting to self-host these solutions. Unlike in the case where cloud resources from OpenAI, Microsoft Azure or their competitors are used, self-hosting requires extensive computational power, substantial data storage, and robust infrastructure \u2014 but also offers tempting benefits like data privacy and cost efficiency. In this article we discuss self-hosting of generative AI and report on the technical and operational hurdles involved. Additionally, we will provide detailed information on how we have built our own platform for multimodal foun-dation models, offering insights into the necessary steps and considerations for successful implemen-tation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/strong><\/p><p>Generative AI and foundation models have caught significant attention after the release of ChatGPT in 2022. Today, interest in these models has expanded to numerous fields and business units, highlighting the substantial demand for AI solutions based on foundation models. Many in-stances of generative AI are available: Closed-source foundation models and GenAI services are generally provided via commercial APIs or public cloud platforms, whereas open-source alterna-tives are distributed through model artifacts (\u201ccheckpoints\u201d) on platforms like Hugging Face1. In both scenarios, getting a grip on high-performance and cost-efficient generative AI ser-vices is challenging. So far, not many production-ready on-premises solutions exist yet. Additional-ly, the alarming increase in concerns about imple-mentation costs reported in [1] shows the neces-sity of alternative solutions to public cloud ser-vices, especially as costs for public cloud services usually scale linearly with usage.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>Numerous platforms are available for accessing, demonstrating and comparing large language and more general foundation models. These include:<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><ol><li><strong>OpenAI:<\/strong> The most prominent platform offer-ing ChatGPT, various versions of GPTs, DALL\u00b7E, and Sora for text, image, and video generation as a service based on (closed-source) models.<br \/><br \/><\/li><li><span class=\"Apple-converted-space\"><strong>Amazon Bedrock playgrounds:<\/strong> A platform for testing inference on different models before they can be deployed in an application (non-public). Additionally, PartyRock5 provides a code-free playground for building AI applica-tions based on Bedrock.\u00a0<br \/><br \/><\/span><\/li><li><strong>NVIDIA AI Playground:<\/strong> This platform allows users to test models from an increasingly larger catalog via model-specific demo user interfaces (UI).<span class=\"Apple-converted-space\">\u00a0<\/span><\/li><li><p><strong>Databricks AI Playground:<\/strong> A playground to test, prompt, and compare different large language models (non-public).<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/li><li><p><strong>Vercel AI model comparison:<\/strong> Focused on an SDK for comparing different models, this platform also aims at simplifying the develop-ment of Java-\/TypeScript interfaces.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/li><li><strong>Hugging Face<\/strong> offers the largest collection of open-source models, including an inference API and a UI for testing individual models.<\/li><\/ol><p>In addition to platforms where models are host-ed, there are platforms that serve as gateways to other providers. These gateways aim to simplify the comparison and replacement of LLMs by of-fering a more unified interface:<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><ol><li><p><strong>Kong AI Gateway:<\/strong> Currently supports the providers OpenAI, Cohere, Azure, Anthropic, Mistral and some self-hosted models.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/li><li><p><strong>MLflow Deployments Server:<\/strong> Can be set up locally in minutes, with providers specified by a simple configuration file.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>In the following, we outline how to build a self-contained, on-premises infrastructure for infer-ence based on multi-modal foundation models that operate on text, images, audio, embeddings, and their combinations. It is designed to comply with data privacy, access management, IT securi-ty, trustworthiness, and \u2014 most importantly \u2014 usability for a wide range of downstream re-search and business scenarios. Our own instance of this setup is used by AI researchers and engi-neers to rapidly develop proofs-of-concept for GenAI-centric applications. Moreover, it is regu-larly used in workshops for companies beginning to adopt GenAI for their businesses.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><strong>Use Cases, Models and Features<span class=\"Apple-converted-space\">\u00a0<\/span><\/strong><\/p><p>The platform addresses all common conversation scenarios: text in\u2014text out (for text generation and chatbots), text in\u2014audio out (for speech syn-thesis), text in\u2014embedding out (for retrieval sys-tems), text in\u2014image out (for image or more gen-eral content creation), and audio in\u2014text out (for transcriptions, speech recognition and audio chatbots). Each of these scenarios can be sup-ported by different capable open-source models (with permissive licenses).<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>The following models have been tested on vari-ous tasks and are currently accessible in our in-stance:<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><table cellspacing=\"0\" cellpadding=\"0\"><tbody><tr><td valign=\"top\"><p>Model<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>Input<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>Output<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>Meta: Llama 3 8b &amp; 70b chat<\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>MistralAI: Mistral-7B-Instruct-v0.3<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>MistralAI: Mixtral-8x7B-Instruct-v0.1<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>StabilityAI: Stable Diffusion SD-XL 1.0<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>image<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>OpenAI: Whisper-large-v2<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>audio<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>primeLine: Whisper-large-v3-german<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>audio<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>NVIDIA: FastPitch (en-US)<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>audio<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>Meta: MMS text-to-speech (DE)<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>audio<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><tr><td valign=\"top\"><p>SentenceTransformers: all-mpnet-base-v2 (embedding model)<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>text<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><td valign=\"top\"><p>vector<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/td><\/tr><\/tbody><\/table><\/li><\/ol><p>Each of these I\/O combinations requires a differ-ent type of user interface, which is why we have a separate tab for each modality in the UI shown in Fig. 1. In addition, the functionality of the mod-els is accessible through a dedicated API, which allows for larger workloads and in general more traffic on the system. After all, the applications we build on top of the models don\u2019t use the UI.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><strong>About the Technical Architecture<span class=\"Apple-converted-space\">\u00a0<\/span><\/strong><\/p><p>The architecture contains the following elements:<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><ol><li><em>Frontend:<\/em> The user interface shown in Fig. 1 is based on Gradio. It provides different tabs for Text Generation, Automatic Speech Recognition, Speech Synthesis and Image Generation. In each tab, the user can select a model from the list of available models and can interact with it. The frontend com-municates with an Identity Access Manage-ment (IAM) system for authentication and with the API server for model access and inference.<br \/><br \/><\/li><li><em>Backend:<\/em> We use a node concept for the model backends. Each node represents a model serving component which serves a single or multiple models. We use NVIDIA Triton Inference Server, vLLM and Hugging Face\u2019s Text Generation Inference (TGI) as serving components. The nodes provide standard interfaces and are KServe- or OpenAI-compatible.<br \/><br \/><\/li><li><em>API:<\/em> The API server is written in Golang and offers clients a standardized interface to various node protocols of the backends by adapting the inference requests. It supports all conversation scenarios described before. Moreover, responses can be requested to be synchronous, asynchronous (via sequen-tial queueing) or streamed. Asynchronous inference results are cached until retrieved by the user.<br \/><br \/><\/li><li><em>Moderation:<\/em> Requests and responses can optionally be sent to a moderation service that uses classifiers for toxicity, prompt in-jection and personally identifiable information to detect content that should be filtered.<br \/><br \/><\/li><li><p><em>Databases:<\/em> We use the following two data-base instances: Redis for caching results per user and storing global and node configura-tions, and a PostgreSQL database for storing all user-related information. The latter is only accessed by the IAM system.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><\/li><\/ol><p><em>Monitoring:<\/em> We provide health and metrics endpoints for the frontend and the API server and use the health and metrics end-points that the model backends provide. The metrics endpoints are consumed by a Prometheus instance and visualized in Grafana dashboards to gain insights about traffic, cost, energy consumption, GPU (= Graphics Processing Unit) load and usage statistics for models and users.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>Lastly, the architecture is extensible and checks common security boxes like TLS communication between all servers, RBAC for the platform and a dedicated IAM system that takes care of authenti-cation and authorization. This enables usage in production settings.<span class=\"Apple-converted-space\">\u00a0<br \/><\/span><\/p><p><strong>Technical Requirements<span class=\"Apple-converted-space\">\u00a0<\/span><\/strong><\/p><p>The technical requirements to run such an infra-structure (with and without tweaks) are as fol-lows:<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>Foundation models require GPUs for perfor-mance reasons. The model size correlates with the GPU device\u2019s VRAM. A rough but convenient estimate is the following: Twice the number of model parameters (in billions) approximately gives the required VRAM (in GB), so 2 x 7 = 14 GB VRAM for a 7B parameter model. For up to four 7 billion parameter models, a single NVIDIA A100 or H100 GPU with 80 GB VRAM is sufficient (with some necessary overhead). With the use of quan-tization techniques \u2014 a common tweak to reduce the effective model size \u2014 even larger models fit on such a GPU. As an example, a 4-bit quantized version of the powerful 8x22B Mixtral model de-veloped by Mistral AI fits on such a device. Of course, quantization can also be a cost-effective option for hosting models on much smaller GPUs. After all, the price tag on NVIDIA A100 and H100 GPUs is impressive and cheaper (but less perfor-mant) GPUs like the V100 or some of the NVIDIA RTX 3000\/4000 series do the job in smaller set-tings, too. Other requirements are quite modest, as the machines only require standard CPUs, a common network and should have a containeriza-tion software such as Docker installed. So, the biggest hurdle is the cost factor of the GPUs.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><strong>Wrap-up<\/strong><span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>We showed how to set up an infrastructure for foundation models that allows for cutting-edge demonstrations of their capabilities. This infra-structure can run in on-premises production envi-ronments. It allows for UI- and API-based access and provides access management and robust monitoring.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p>Our solution is already used in many customer and research projects, with increasing demand. For more information, just get in touch with us \u2013 or have a look at our various activities and offers around Generative AI [2].<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><img fetchpriority=\"high\" decoding=\"async\" class=\"wp-image-2350 size-medium alignleft\" style=\"width: 146px; height: auto;\" src=\"https:\/\/dev-techl.eu\/wp-content\/uploads\/2024\/07\/Dr.-Dennis-Wegener-300x300.jpg\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Dennis-Wegener-300x300.jpg 300w, https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Dennis-Wegener-150x150.jpg 150w, https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Dennis-Wegener-230x229.jpg 230w, https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Dennis-Wegener.jpg 335w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p><p><\/p><p><strong>Dr. Dennis Wegener<\/strong><br \/>Teamlead MLOps<br \/>Fraunhofer IAIS<\/p><p>\u00a0<\/p><p>\u00a0<\/p><p><img decoding=\"async\" class=\"wp-image-2349 size-medium alignleft\" style=\"width: 146px; height: auto;\" src=\"https:\/\/dev-techl.eu\/wp-content\/uploads\/2024\/07\/Dr.-Benny-Stein-300x300.jpg\" alt=\"\" width=\"300\" height=\"300\" srcset=\"https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Benny-Stein-300x300.jpg 300w, https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Benny-Stein-150x150.jpg 150w, https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Benny-Stein-230x230.jpg 230w, https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/Dr.-Benny-Stein.jpg 332w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p><p><\/p><p><strong>Dr. Benny Stein<\/strong><span class=\"Apple-converted-space\"><strong>\u00a0<\/strong><br \/><\/span>MLOps Engineer<span class=\"Apple-converted-space\">\u00a0<br \/><\/span>Fraunhofer IAIS<span class=\"Apple-converted-space\">\u00a0<\/span><\/p><p><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>","protected":false},"excerpt":{"rendered":"<p>Hosting, operating and monitoring generative AI (GenAI) solutions is challenging, in particular if cloud resources provided by OpenAI or Azure cannot deliver [&hellip;]<\/p>","protected":false},"author":1524,"featured_media":2451,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1,36,35],"tags":[],"class_list":["post-2354","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-general","category-it-security","category-software"],"featured_media_src_url":"https:\/\/www.united-innovations.eu\/wp-content\/uploads\/2024\/07\/pexels-tara-winstead-8386440-scaled.jpg","_links":{"self":[{"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/posts\/2354","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/users\/1524"}],"replies":[{"embeddable":true,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/comments?post=2354"}],"version-history":[{"count":10,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/posts\/2354\/revisions"}],"predecessor-version":[{"id":2365,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/posts\/2354\/revisions\/2365"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/media\/2451"}],"wp:attachment":[{"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/media?parent=2354"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/categories?post=2354"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.united-innovations.eu\/de\/wp-json\/wp\/v2\/tags?post=2354"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}