Improving Real-Time Inference with Anchor Tokens

by AnchoringOctober 10th, 2024

Too Long; Didn't Read

This section discusses how training models to compress information into anchor tokens can optimize inference by reducing keys/values caches. A new algorithm is introduced to manage cache efficiently during prefix processing and token generation in real-time.

featured image - Improving Real-Time Inference with Anchor Tokens

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author.

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

A More Experimental Results

B Data Settings

3.3 Anchor-based Inference

By training the model to compress information into the anchor token of a natural language sequence, we can optimize the inference process by modifying the keys/values caching mechanism. Specifically, during inference, upon encountering an anchor token that condenses the comprehensive semantic information of preceding tokens in the current sequence, the model can reduce the keys/values caches by deleting the caches of non-anchor tokens within that sequence.

We introduce the inference method in Algorithm 1. The function “REDUCTION” in Line 1 is utilized to remove keys/values caches when the model processes prefix texts in Line 10 or generates an anchor token during the prediction of the next

token in Line 16. This approach aims to reduce the keys/values caches for both prefix tokens and generated outputs during real-time inference.

This paper is under CC BY 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Anchoring@anchoring

Anchoring provides a steady start, grounding decisions and perspectives in clarity and confidence.

Read my stories About @anchoring

TOPICS

tech-stories #anchor-based-llms #anllms #transformer-architecture #gpu-memory-optimization #anchor-self-attention-network #in-context-learning #natural-language-modeling #decoder-only-architecture

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Improving Real-Time Inference with Anchor Tokens

Too Long; Didn't Read

Table of Links

3.3 Anchor-based Inference

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES