High-Performance Go Service Architecture for Millions of Connections at Baidu

introductuon

In the mobile internet era, the demand for real-time and interactive services has surged, making long connection services essential for applications. Unlike short connections, which follow a request-response model, long connections keep a network data channel open between the application and the server for continuous, full-duplex data transmission, allowing the server to push data to users in real-time.

Long connection services must achieve low latency, high concurrency, and high stability, which can be challenging and costly if each business maintains its own service. Therefore, the unified long connection project aims to provide a comprehensive solution, offering secure, high-concurrency, low-latency, easy-to-integrate, and cost-effective long connection services for various businesses.

background

Unified Long Connection Service

The primary goal of the unified long connection service is to offer businesses a secure, high-concurrency, low-latency, easily integrable, and cost-effective long connection system. Key objectives include:

  • Supporting major Baidu app scenarios such as live streaming, messaging, PUSH, and cloud control with secure long connection capabilities.
  • Ensuring high concurrency, stability, and low latency, maintaining the system’s professionalism and advanced nature.
  • Enabling multiple business long connection reuse, reducing the cost and burden of establishing and maintaining connections.
  • Providing a straightforward integration process with clear external interfaces for quick business integration.

background

Challenges

To build a long connection service that meets business needs, the unified long connection service faces several challenges during its design, development, and maintenance. These challenges primarily fall into two categories:

Functionality Implementation

The main challenge in designing a long connection service is defining clear boundaries between the unified service and individual business integrations. Unlike dedicated services for specific businesses, the unified service must support multiple businesses sharing a single long connection. This requires accommodating various business requirements and scenarios while avoiding excessive business logic in the unified service to ensure scalability and future development.

Typical business requirements for long connection services include:

  • Establishing, maintaining, and managing connections.
  • Forwarding upstream requests.
  • Pushing downstream data.

During data transmission, the service must support different data protocols and push models depending on the business type:

  • Messaging: Private messages and small group chats (500-1000 members),
    primarily using unicast and batch unicast push modes with varying push frequency and concurrency.
  • Live Streaming: Multicast to millions of viewers with high push frequency.
  • cloud control: Sending messages to fixed groups in batch unicast mode
  • PUSH Notifications: Sending messages to fixed groups in batch unicast mode with lower push frequency.
business push scenarios push ups frequency
Messaging unicast / batch unicast 10K level high
live stream group cast 10M level high
cloud control batch cast 1M level low
push batch cast 1M level low

Consequently, the unified long connection service must provide the following capabilities:

  1. Connection establishment, maintenance, and management.
  2. Upstream and downstream data forwarding, accommodating different business data protocols.
  3. Downstream push, supporting unicast, batch unicast, and broadcast.

functional goals

Performance Optimization

The unified long connection service must achieve high concurrency, high availability, and high stability to serve Baidu’s apps. Specific performance aspects include:

performance standard desc
concurrent connections 10M level horizontal scaling
upstream qps 1M Level horizontal scaling
downstream qps 10M Level horizontal scaling
latency 10ms level

Connection QPS, latency, and success rate

Long connections need to be established quickly when the app opens and maintained while the app is active. The service must support thousands of QPS for connection establishment and millions of concurrent online connections, with horizontal scaling capabilities. Connection establishment is fundamental, with success rate and latency being critical.

Upstream request QPS, latency, and success rate:

Once the connection is established, business requests need to be forwarded to the backend, supporting at least tens to hundreds of thousands of QPS, with horizontal scaling.

Downstream request QPS, latency, and success rate:

Depending on the business scenario, downstream requests may involve batch unicast or multicast. Generally, batch unicast should support millions of UPS, and multicast should support tens of millions of UPS, with horizontal scaling.

Overall Architecture

overall

Architecture

The long connection service consists of four main components: the Unified Long Connection SDK, the Control Layer, the Access Layer, and the Route Layer.

protocol

The vision of the unified long connection is to support multiple services reusing a single long connection. This allows different business data protocols to be compatible on the same connection and differentiates requests for upstream and downstream business data forwarding.

This is achieved through a private long connection data protocol, which consists of three parts:

  1. Protocol Header: Includes protocol identification, protocol version, etc.
  2. Common Parameters: Device identification, application identification, business identification, request metadata, etc.
  3. Business Data: Custom business data compatible with different business data protocols.
    The long connection service only handles forwarding and does not involve specific business details.

The approximate format of the protocol is as follows (not an actual protocol):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Protocol {
Header {
Protocol ID (type) = value,
Protocol Version (type) = value,
},
Common Parameters {
Device ID (type) = value,
App ID (type) = value,
Business ID (type) = value,
Request Metadata (type) = {
key: value,
key: value,
...
},
},
Business Data {
Custom Data 1 (type) = value,
Custom Data 2 (type) = value,
...
}
}

control layer

Control Layer’s Main Functions:

  1. Verify device legitimacy and determine access strategies before connection establishment.
  2. Generate and verify tokens for device authentication.
  3. Assign access points based on client properties.
  4. Manage small traffic control strategies.

control layer

Access layer

The main functions of the access layer include:

  1. Peer Communication: Establishes, maintains, and releases long connections with the SDK.
  2. Connection Management: Manages connections and maps connection IDs to connection information.
  3. Group Management: Manages connection groups and maps group IDs to connection information.
  4. Upstream Forwarding: Forwards business requests to the backend and returns responses to the SDK.
  5. Downstream Pushing: Receives push requests and sends them to the corresponding SDK.

connection state management

state
Long connections require validation of their legitimacy and effectiveness, as well as rapid response to anomalies. To achieve this, Unified Long Connection employs a state machine. This mechanism clearly defines the various states a long connection can assume during its lifecycle, the actions each state can trigger, and the conditions under which transitions occur.

For instance, before transmitting data, a connection must undergo login validation to ensure its legality. Once authenticated, the connection can engage in data transmission. In case of anomalies post-login, such as data format issues or network disruptions, the system triggers connection invalidation and initiates a reconnect process.

The state machine simplifies the development logic for managing long connection states, ensuring clear definitions of each state and transition condition. This approach helps prevent situations where connections cannot recover due to unknown reasons.

Multi-protocol support

Long connections rely on protocols like TCP, TLS, QUIC, and WebSocket. Different scenarios use different protocols; for example, NA clients use TCP and TLS, while mini programs and web clients use WebSocket. To adapt and enhance connection quality, we support multiple protocols.

Connection Layer: Manages specific protocols (e.g., TLS, WebSocket, QUIC) and provides a unified data interface for the session layer. New protocols are adapted here without affecting session logic.
Session Layer: Maintains long connection business logic (e.g., request forwarding, downstream pushing) and interacts with the connection layer, unaware of specific protocol details.

Clients select protocols and access points based on their conditions (e.g., client type, network type, device quality).

multi-protocol

Advantages:

  1. Isolates business logic from protocol details, simplifying support for multiple protocols.
  2. Clients choose protocols based on conditions, improving connection quality.

upstream

request forward

After the access layer identifies the source of business data, it forwards this data to the business server via RPC and sends the server’s response back to the client. Along with business request data, it includes long connection public parameters. If needed, the access layer can also notify the business server in real-time about connection status changes, like disconnections, for the server to take appropriate actions.

request forward

down stream

There are mainly two downstream cast types:

  • batch unicast: pushes messages to specific devices by mapping device IDs to connection information.
  • group cast: sends the same message to multiple users by managing connection groups for efficient distribution.

unicast/batch unicast

Unicast Push: The server pushes a message to a specific device by determining its connection instance and connection ID. This involves mapping the device ID to connection information (instance IP + connection ID). The main task is to identify the device ID for the target user:

Device-oriented scenarios: Push directly via an interface.
unicast

group cast

Groupcast Push: Used for scenarios like live streaming, where the same message is sent to many users. The routing layer maintains connection groups, mapping group IDs to their connections. Businesses control group creation, and clients join or leave groups. Once established, the long connection service distributes messages to all connections in the group.

To use connection groups, the business needs to:

  1. Create connection groups.
  2. Manage client joins and leaves.
  3. Push messages using connection IDs.
    unicast

summerrise

unified long connection system now supports tens of millions of concurrent connections and handles millions of UPS bulk unicast and multicast messages with real-time scaling. Since launch, it has remained stable, successfully managing high-concurrency events without impacting other services. Overall, the project has met service quality expectations.

Key Insights:

  1. Requirements Analysis: Clearly define boundaries between requirements and business logic to maintain service stability.
  2. Technical Design: Opt for simple solutions that meet clear requirements. Focus on stability and high performance rather than complex solutions.
  3. Operations: Balance single-instance performance with maintenance needs. Multiple smaller instances often offer better stability and resource efficiency.

references

千万级高性能长连接Go服务架构实践 by glstr

Many thanks to the original author, glstr