Partitioning an LLM between cloud and edge

Historically, large language models (LLMs) have required substantial computational resources. This means development and deployment are confined mainly to powerful centralized systems, such as public cloud providers. However, although many people believe that we need massive amounts of GPUs bound to vast amounts of storage to run generative AI, in truth, there are methods to use a tier or partitioned architecture to drive value for specific business use cases.

Somehow, it’s in the generative AI zeitgeist that edge computing won’t work. This is given the processing requirements of generative AI models and the need to drive high-performing inferences. I’m often challenged when I suggest “knowledge at the edge” architecture due to this misperception. We’re missing a huge opportunity to be innovative, so let’s take a look.

It’s always been possible

This hybrid approach maximizes the efficiency of both infrastructure types. Running certain operations on the edge significantly lowers latency, which is crucial for applications requiring immediate feedback, such as interactive AI services and real-time data processing. Tasks that do not require real-time responses can be relegated to cloud servers.

Partitioning these models offers a way to balance the computational load, enhance responsiveness, and increase the efficiency of AI deployments. The technique involves running different parts or versions of LLMs on edge devices, centralized cloud servers, or on-premises servers.

By partitioning LLMs, we achieve a scalable architecture in which edge devices handle lightweight, real-time tasks while the heavy lifting is offloaded to the cloud. For example, say we are running medical scanning devices that exist worldwide. AI-driven image processing and analysis is core to the value of those devices; however, if we’re shipping huge images back to some central computing platform for diagnostics, that won’t be optimal. Network latency will delay some of the processing, and if the network is somehow out, which it may be in several rural areas, then you’re out of business.

About 80% of diagnostic tests can run fine on a lower-powered device set next to the scanner. Thus, routine things that the scanner is designed to detect could be handled locally, while tests that require more extensive or more complex processing could be pushed to the centralized server for additional diagnostics.

Other use cases include the diagnostics of components of a jet in flight. You would love to have the power of AI to monitor and correct issues with jet engine operations, and you would need those issues to be corrected in near real time. Pushing the operational diagnostics back to some centralized AI processing system would not only be non-optimal but unsafe.

Why is hybrid AI architecture not widespread?

A partitioned architecture reduces latency and conserves energy and computational power. Sensitive data can be processed locally on edge devices, alleviating privacy concerns by minimizing data transmission over the Internet. In our medical device example, this means that personally identifiable information concerns are reduced, and the security of that data is a bit more straightforward. The cloud can then handle generalized, non-sensitive aspects, ensuring a layered security approach.

So, why isn’t everyone using it?

First, it’s complex. This architecture takes thinking and planning. Generative AI is new, and most AI architects are new, and they get their architecture cues from cloud providers that push the cloud. This is why it’s not a good idea to allow architects who work for a specific cloud provider to design your AI system. You’ll get a cloud solution each time. Cloud providers, I’m looking at you.

Second, generative AI ecosystems need better support. They offer better support for centralized, cloud-based, on-premises, or open-source AI systems. For a hybrid architecture pattern, you must DIY, albeit there are a few valuable solutions on the market, including edge computing tool sets that support AI.

How to build a hybrid architecture

The first step involves evaluating the LLM and the AI toolkits and determining which components can be effectively run on the edge. This typically includes lightweight models or specific layers of a larger model that perform inference tasks.

Complex training and fine-tuning operations remain in the cloud or other eternalized systems. Edge systems can preprocess raw data to reduce its volume and complexity before sending it to the cloud or processing it using its LLM (or a small language model). The preprocessing stage includes data cleaning, anonymization, and preliminary feature extraction, streamlining the subsequent centralized processing.

Thus, the edge system can play two roles: It is a preprocessor for data and API calls that will be passed to the centralized LLM, or it performs some processing/inference that can be best handled using the smaller model on the edge device. This should provide optimal efficiency since both tiers are working together, and we’re also doing the most with the least number of resources in using this hybrid edge/center model.

For the partitioned model to function cohesively, edge and cloud systems must synchronize efficiently. This requires robust APIs and data-transfer protocols to ensure smooth system communication. Continuous synchronization also allows for real-time updates and model improvements.

Finally, performance assessments are run to fine-tune the partitioned model. This process includes load balancing, latency testing, and resource allocation optimization to ensure the architecture meets application-specific requirements.

Partitioning generative AI LLMs across the edge and central/cloud infrastructures epitomizes the next frontier in AI deployment. This hybrid approach enhances performance and responsiveness and optimizes resource usage and security. However, most enterprises and even technology providers are afraid of this architecture, considering it too complex, too expensive, and too slow to build and deploy.

That’s not the case. Not considering this option means that you’re likely missing good business value. Also, you’re at risk of having people like me show up in a few years and point out that you missed the boat in terms of AI optimization. You’ve been warned.

Copyright © 2024 IDG Communications, Inc.