Token Management
Release Time 2025-03-03
Scene Description
The AI gateway can track the number of tokens used by LLMs and impose restrictions when consumers exceed their limits, thereby better managing user quotas for AI applications and providing data support for token usage analysis.
The token management scenario is based on consumer authentication, token rate limiting, and token quota plugins. It integrates observability capabilities to transform token resources into quantifiable, manageable, and optimizable service units. Based on custom strategies, it ensures the stability, security, and fairness of services under high concurrency.
Deploy Higress AI Gateway
This guide is based on Docker deployment. If you need other deployment methods (such as k8s, helm, etc.), please refer to Quick Start。
Execute the following command:
curl -sS https://higress.cn/ai-gateway/install.sh | bash
Follow the prompts to enter the Aliyun Dashscope or other API-KEY; you can also press Enter to skip and modify it later in the console. You can also press Enter
to skip and modify it later in the console.
The default HTTP service port is 8080, the HTTPS service port is 8443, and the console service port is 8001. If you need to use other ports, download the deployment script using wget https://higress.cn/ai-gateway/install.sh
, modify DEFAULT_GATEWAY_HTTP_PORT/DEFAULT_GATEWAY_HTTPS_PORT/DEFAULT_CONSOLE_PORT, and then execute the script using bash.
After the deployment is completed, the following command display will appear.
Console Configuration
Access the Higress console via a browser at http://localhost:8001/. The first login requires setting up an administrator account and password.
In the LLM Provider Management
, you can configure the API-KEYs for integrated suppliers. Currently integrated suppliers include Alibaba Cloud, DeepSeek, Azure OpenAI, OpenAI, DouBao, etc. Here we configure multi-model proxies for Tongyi Qwen, which can be ignored if already configured in the previous step.
Configure Consumers
In the Consumer Management
of the console, create consumers for the current gateway to manage quotas and send requests.
Click to create a consumer, and based on Key Auth
, create three consumers named aliyun-admin, aliyun-user1, and aliyun-user2. Authentication is performed based on the x-api-key
field in the HTTP Header.
Configure Redis Storage Service
A Redis service needs to be created for caching token usages. This example uses Docker to set up a local Redis service for Higress.
Build Redis Service
- Use the docker command to start a redis container
docker run --name my-redis -p 6379:6379 -d redis
- Check the IP address of the my-redis service:
-
Use
docker network ls
to get the NETWORK ID of the bridge network. -
Use
docker network inspect <network-id>
to check if the my-redis container is connected to the bridge network.- If not, connect it to the network using the
docker network connect bridge my-redis
command.
- If not, connect it to the network using the
-
- Get the IP address of the my-redis service.
Configure Redis Service Source
Create a service source in the console’s Service Sources
. Fill in the corresponding fields in the Service Sources
:
- Type: Static Addresses
- Service Address: Concatenate the IP address of my-redis with the service port
- Service Protocol: HTTP
Configure AI Route Strategy
Consumer Authentication Configuration
In the AI Route Config
, configure consumers for aliyun and click Edit.
Enable request authentication and add the consumers created earlier.
Token Quota Configuration
In the AI Route Config
, click Edit and configure AI Quota
for aliyun .
Fill in the following fields as a reference in AI Quota
configuration:
redis_key_prefix: 'chat_quota:'admin_consumer: aliyun-adminadmin_path: /quotaredis: service_name: local-redis.static service_port: 80 timeout: 2000
Token Rate Limiting Configuration
In the AI Route Config
, click Edit and configure AI Token Rate Limit
for aliyun.
Fill in the following fields as a reference in AI Token Rate Limit
configuration:
rule_items:- limit_by_per_header: x-api-key limit_keys: - key: "*" token_per_minute: 5 # Limit to 5 tokens per minuterule_name: "default_rule"
redis: service_name: local-redis.static service_port: 80
Debugging
Open the system’s built-in command line and send a request using the following command (if the HTTP service is not deployed on port 8080, modify it to the corresponding port):
# Query quota, x-api-key is the credential for aliyun-admincurl 'http://localhost:8080/v1/chat/completions/quota?consumer=aliyun-user1' \ -H 'x-api-key:xxxxxxxxxxxx' \ -H 'x-higress-llm-model: qwen-max'
# Refresh quota, x-api-key is the credential for aliyun-admincurl 'http://localhost:8080/v1/chat/completions/quota/refresh' \ -d 'consumer=aliyun-user1"a=100' \ -H 'x-api-key:xxxxxxxxxxxx' \ -H 'x-higress-llm-model: qwen-max'
# Increase quota, x-api-key is the credential for aliyun-admincurl 'http://localhost:8080/v1/chat/completions/quota/delta' \ -d 'consumer=aliyun-user1&value=100' \ -H 'x-api-key:xxxxxxxxxxxx' \ -H 'x-higress-llm-model: qwen-max'
# Request, x-api-key is the credential for aliyun-user1curl 'http://localhost:8080/v1/chat/completions' \ -H 'x-api-key:xxxxxxxxxxxx' \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen-max", "messages": [ { "role": "user", "content": "Who are you?" } ] }'
Sample response:
Observability
In the AI Dashboard
, you can observe AI requests. Observability metrics include the number of input/output tokens per second, token usage by each provider/model, and the token usage of consumers etc.
If you encounter any issues during deployment, feel free to leave your information in the Higress Github Issue.
If you are interested in future updates of Higress or wish to provide feedback, welcome to star Higress Github Repo.