Whitepaper

Abstract

This whitepaper introduces a novel large language model (LLM) architecture that demonstrates significant improvements in large language model’s processing tasks, that involves inference fine tuning and predictive tasks. The proposed architecture incorporates a series of chained small Language Models (SLMs) to allow seamless transfer learning in between the connected SLMs and enabling it to work in unison. There will be several innovative techniques, including implementing a ‘main’ routing SLM that is trained on the capabilities of multiple SLMs to be capable of directive and routing service, so that when enquired through the connecting chatbot interface from user prompts, it knows which SLMs in the network that is trained on different but complementary domain specific dataset to process the prompt and respond accordingly.

Custom Attention Mechanism

In our model is designed to be capable of multimodal input processing, with near limitless context window as we are using our own customized attention mechanism that is designed to enhance the ‘One-shot’ learning capability to allow our model to learn faster, forms better contextual understanding with less data and be capable of generating highly coherent responses back to users.

Enhanced ‘One-shot learning’, means faster learning with lesser data and training time, that translates significantly to the reduction in training time and cost.

This emulates the concept of federated learning and because the processing of prompts into multiple smaller pieces, by individual SLM, and the learning/memory are shared to the next SLM to continue, the cost of inference will be significantly lower as each individual smaller model are much less complex compared to it being processed at one go by a gigantic model like ChatGPT with trillions of parameters.

This architecture will then be further supported by a distributed infrastructure consisting of multiple cloud server instances, hence balance out the computational demand rather than expecting a few thousand high end GPUs like H100 to run it, hence potentially reducing the cost of inferencing. With sufficient load balancing.

Potential Cost Efficiency:

Large language models (LLMs) like GPT-4 require significant processing power. While high-end GPUs like the Nvidia A100 are typically used, exploring alternative solutions can be interesting.

Potential Savings with Smaller Capacity Consumer GPUs but in greater numbers:

Potential Savings with Smaller Capacity Consumer GPUs but in greater numbers:

Let’s consider a hypothetical scenario. If a less powerful GPU like the RTX 3090 could be effectively utilized for LLM inference, there could be a cost advantage. While an A100 might be 10x more powerful, it is also many times more expensive than say RTX3090 GPUs range. But in order for RTX 3090 to come close to one A100 may require 8x more RTX 3090s to achieve similar performance, . This translates to a potential upfront cost reduction:

  • Estimated A100 cost (per unit): $30,000
  • Estimated RTX 3090 cost (per unit): $1,800
  • Hypothetical A100s needed for GPT-4 (based on previous assumptions): 1,260
  • Hypothetical RTX 3090s needed for similar performance: 1,260 x 8 = 10,080
  • Total cost for A100s: $30,000 x 1,260 = $37,800,000
  • Total cost for RTX 3090s: $1,800 x 10,080 = $18,144,000

This simplified calculation suggests a potential cost savings of nearly 52%. However, it’s crucial to consider limitations:

Performance Caveats: Consumer GPUs might not be optimized for LLM tasks, impacting efficiency and scaling. Hidden Costs: Power consumption, cooling, and infrastructure needs might be higher with more consumer-grade GPUs.

While the potential cost savings are intriguing, a complete picture requires a more nuanced analysis. Specialized hardware like A100s might still be more efficient in the long run, despite higher upfront costs. Factors like power consumption, cooling, and software optimization for LLMs need careful consideration. However this is still an interesting consideration for us and we need to mix and match different types of hardware to achieve the optimum results.

Our idea takes these potential savings to another level through incorporating distributed computing. Get in touch with us and we will be happy to share more.

To get a full whitepaper, please click on the link below:

Our Github Repository:

Distributed Specialized Language Model (SLM) Network

to improve efficiency and accuracy of language models