Introduction

In today’s digital world, real-time messaging is crucial for communication. Two popular ways of messaging are group chats and live stream messages. Both serve to connect people instantly but in different contexts and styles. This article will explore the technical aspects of these two systems, highlighting their similarities and differences.

Basic Definitions and Use Cases

Group Chats:

Group chats are online spaces where multiple people can communicate in real-time. These are often used for team collaboration, social interactions, or family discussions. In group chats, messages are stored, and participants can review the conversation history at any time. Apps like Slack, Discord, and WhatsApp are common examples.

Use Cases for Group Chats:
Team Collaboration: Team members discuss projects, share files, and make decisions together.
Social Groups: Friends and family members stay in touch, share updates, and plan events.
Communities: People with shared interests engage in discussions, exchange ideas, and offer support.

Live Stream Messages:

Live stream messages are real-time comments or interactions that happen during a live video broadcast. These messages appear instantly on the screen, allowing the audience to interact with the streamer and each other. Platforms like Twitch, YouTube Live, and Facebook Live provide these features.

Use Cases for Live Stream Messages:
Audience Interaction: Viewers ask questions, share thoughts, and react to the content in real-time.
Events and Webinars: Participants engage with speakers, participate in polls, and join Q&A sessions.
Gaming Streams: Gamers interact with their audience, get live feedback, and build a community around their content.

This comparison will delve deeper into the architecture, performance, features, and security aspects of group chats and live stream messages, helping you understand their unique characteristics and applications.

basic comparsion

group Live Stream Messages
Participants 1k level 1M level
Relationship Chain Present Absent
Member change Low High
Offline Messages focus Not focused
Last time long short
Security end-to-end encryption(option) No encryption

Based on the above table, there are two main issues for live stream messaging:

  1. User Maintenance:

    • Tens of thousands of users join and leave the live stream room every second.
    • Single live stream can have millions of users online simultaneously.
    • Cumulative total of tens of millions of users entering the live stream room.
  2. Message Delivery:

    • With millions of users online, there is a large volume of incoming and outgoing messages.
    • Ensuring the reliable delivery of messages, such as gifts and co-streaming requests.

Since the first issue is not difficult to solve, the following discussion will focus mainly on the second issue.

Architecture and Infrastructure

Group Chat

Typical architecture

Message Diffusion

In group chats, there are generally two modes: read diffusion and write diffusion.

  • Read Diffusion: The group message is stored only once, and all users in the group pull the message from the group’s public mailbox.

  • Write Diffusion: When a message is sent, it is dispathced into each person’s mailbox.

However, strictly speaking, even for read Diffusion, the message can be stored only once, but each user’s fetch_msgid, ack_msgid, read_msgid, begin_msgid, etc., are different. Therefore, a process of dispatching messages for write spread is inevitable. Thus, this article will directly discuss the write Diffusion model.

overall

The group experiences two types of diffusion: 1 group -> m users and 1 user -> n devices.

Where m depends on the number of group members, and n depends on the number of devices per user. Typically, m does not exceed 1K, and n does not exceed 5.

If 1000 groups are sending messages simultaneously, the split scale is approximately 100 million. As this scale continues to increase, the splitting process puts enormous pressure on the entire system, leading to greater resource consumption.

Message storage and retrieval.

overall
For group chat messages, a combination of push and pull methods is usually used to prevent message loss. If the server directly pushes messages to the client and the notification is lost, the message might also be lost. Therefore, the common approach is to use a combination of push and pull. When the server receives a message, it first stores the message and then notifies the client that a new message has arrived. When the client receives the notification, it pulls the new message. This way, even if a notification for a particular message is lost, the client can still retrieve the missing message the next time it pulls messages.

Security and Privacy

Compared to the C2C scenario, where end-to-end encryption is relatively easy to implement, the cost of achieving end-to-end encryption in group chats is significantly higher. However, there are feasible solutions, such as those used by iMessage and Signal. Implementing end-to-end encryption in group chats is complex and will not be discussed in this article.

Live Stream Messages

Typical architecture

overall
Live stream messages mainly experience two amplification stages: 1M * 1N.

  • M depends on the number of LCS (Long Connection Service) instances and the number of viewers in the live stream room.
  • N depends on the number of connections joining a particular multicast on a single instance.
    M can reach hundreds of instances, and since N is maintained in memory, N can easily reach a scale of 100,000.

Therefore, with 100 instances, each with 100,000 connections joining the same room, it is possible to support 10 million simultaneous connections in a single room.

route layer

The main functions of the routing layer are:

  1. To find which long connection instances have connections that joined the room based on the mcastid.
  2. To aggregate messages so that each message does not need to be notified individually.
  3. To implement rate limiting protection on the client side. If the processing capacity is exceeded, rate limiting can be applied here.
  4. To prioritize messages, ensuring that high-priority messages are pushed first and not discarded.

lcs layer

The main functions of the connection layer are:

  1. To maintain the map of roomID and connections.
  2. To compress messages, significantly reducing bandwidth usage.
  3. To send messages to the client.

push mode or pull mode

For group messages, we have to guarantee that messages won’t be lost. Therefore, we use a notify + pull mechanism. However, during live stream messages, we can tolerate occasional message loss to ensure messages are more timely. Thus, we typically use a push mode, directly pushing messages to the client.

gift message

Conclusion

For gift messages and other important messages, which are especially significant for the streamer, we can increase their priority. However, in a push mode, disconnections may still result in message loss. Therefore, we have established a separate data stream for gift messages, using a notify + pull mechanism.

Group chats and live stream messages each serve unique purposes in real-time communication. Group chats are ideal for sustained interactions with features like message history and end-to-end encryption, making them suitable for team collaboration and social groups. Live stream messages, however, excel in large-scale, real-time interactions, perfect for events, gaming, and broadcasts.

Technically, group chats use read and write diffusion models for efficient message storage and retrieval, ensuring reliability and security. In contrast, live stream messages prioritize immediacy and scalability, using a push-based system to handle high user volumes and rapid turnover.

Choosing between the two depends on the communication needs, with group chats providing robust, secure interactions and live stream messages offering dynamic, real-time engagement. Understanding these differences helps in selecting the right tool for optimal performance and user satisfaction.

references

introduce

A few years ago, while working on a QUIC-related project, I encountered a very peculiar case of UDP packet loss in a Linux environment. It took me several days to pinpoint the issue, so I’ve decided to document and summarize it.

definition of terms

  • QUIC: A UDP-Based Multiplexed and Secure Transport A UDP-Based Multiplexed and Secure Transport

  • Connection: A QUIC connection is shared state between a client and a server.

  • Connection ID: A connection identifiers of Connection

  • Connection Migration: The use of a connection ID allows connections to survive changes to endpoint addresses (IP address and port), such as those caused by an endpoint migrating to a new network.

  • QUIC client: My QUIC client

  • BGW: My conpany’s layer 4 gateway, a proxy, transparent to user

  • proxy Server: a proxy to forward up and down packet

  • QUIC Server: My Quic Server to handle QUIC connections.

  • quic-go: A QUIC implementation in pure Go

back ground

unicast

The architecture, as depicted in the diagram, is relatively simple.
The issue arises when the number of QUIC connections exceeds a certain limit (typically around 150,000) and persists for some time. This leads to sporadic packet loss, exacerbated by QUIC’s retransmission mechanism, which can escalate into a cascade effect.

During our project, when QUIC was still a draft protocol, we made modifications to it. For instance, due to QUIC’s large handshake packets, our business server had to send certificates during the handshake. We experimented with using the BGW Server to handle the first phase of the handshake and certificate delivery, which led to issues like sequence numbers. However, these details are not pertinent to this article. Looking back, such changes were not ideal for seamless business operations.

proxy server Connection ID facilitates Connection Migration, allowing the same connection to be identified even if the client’s IP and port change. In a typical setup with multiple backend servers, for successful migration, the proxy server ensures that despite changes in client IP and port, packets with the same Connection ID are consistently forwarded to the same target machine. This proxy server, implemented in C++ using epoll and operating as a ***single-threaded *** program, employs consistent hashing based on Connection ID to forward UDP packets.

problem detect

When the number of connections is low, the connections remain stable. However, during stress testing with a higher number of connections, they start disconnecting after a while, with the primary reason being timeouts. To pinpoint the cause of these disconnections, I conducted the following investigations.

Packet Loss Investigation

  1. Monitoring Key Metrics

    • Observed CPU, memory, disk I/O, and network I/O across all stages.
    • None of these metrics hit performance bottlenecks, but network I/O spiked when connections began to drop and stabilized when only a few connections remained.
    • Preliminary conclusion: possible network packet loss and QUIC retransmissions.
  2. Packet Analysis with tcpdump

    • Attempted to analyze packet loss and retransmissions.
    • Packet capturing is feasible with fewer connections and lower traffic.
    • With over 100,000 long connections, capturing and analyzing packets is inefficient due to high data rates. Hence, this approach was temporarily abandoned.
  3. Intermediate Step Elimination

    • BGW (a core company infrastructure component) was unlikely to be the issue but not ruled out entirely.
    • Established two alternate paths:
      • QUIC Client -> QUIC Server
      • QUIC Client -> QUIC Proxy -> QUIC Server
    • The first path showed no issues, while the second path had problems, indicating a potential issue with the QUIC proxy.
    • Identified possible scenarios:
      1. QUIC client failed to send packets, lost during transmission.
      2. QUIC client sent packets, but QUIC proxy didn’t receive them.
      3. QUIC proxy received packets but didn’t forward them.
      4. QUIC proxy forwarded packets, but QUIC server didn’t receive them.
  4. Log Sampling and Analysis

    • To identify the problematic stage, I added logging throughout the entire path.
    • Logging for all connections caused disk I/O bottlenecks, so I implemented sampling (logging for connections where Connection ID % 10000 == 1).
    • Debug log analysis revealed that the QUIC client sent packets, but the QUIC proxy didn’t receive them.
    • Preliminary conclusion: two potential causes:
      • QUIC client sent packets, but they were lost during transmission.
      • Packets were lost during reception by the QUIC proxy (focus area for further investigation).
  • Reviewing Monitoring Details
    • Upon revisiting the monitoring data, I noticed a new detail: the network IN on the QUIC proxy machine was significantly higher than the network OUT.
    • This suggests that data packets are reaching the machine where the proxy is located, but the proxy application is not receiving them.

Detailed Cause Identification

To further investigate why the QUIC proxy application is not receiving the packets, I conducted the following steps:
dataway

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 +----------------+       +----------------+
| Network Adapter| | Ring Buffer |
+----------------+ +----------------+
| |
v v
+----------------+ +----------------+
| Kernel | | System Buffer |
+----------------+ +----------------+
| |
v v
+----------------+ +----------------+
| User Program | | Socket Buffer |
+----------------+ +----------------+

The sending process wasn’t analyzed because it’s similar to receiving but in reverse, and packet loss is less likely, occurring only when the application’s send rate exceeds the kernel and network card’s processing rate.

  1. Check for Packet Loss at the NIC Level

    • Result: No packet loss detected.
    • Command Used: ethtool -S / ifconfig
  2. Check for UDP Protocol Packet Loss

    • Result: Packet receive errors are rapidly increasing at a rate of 10k per second, and RcvbufErrors are also increasing. The rate of increase for packet receive errors is much higher than for RcvbufErrors.
  3. Check UDP Buffer Size

    • Current Sizes: System UDP buffer size is 256k, application UDP buffer size is 2M.
    • Adjustment: Even after increasing the system UDP buffer size to 25M, RcvbufErrors continue to grow.
    • Conclusion: The main cause of packet loss appears to be packet receive errors.
  4. Check Firewall Status

    • Result: Firewall is disabled.
  5. Check Application Load

    • Result: CPU, memory, and disk I/O loads are all low.
  6. Check Application Processing Logic

    • Result: The application uses single-threaded synchronous processing with simple logic to forward packets.
    • Potential Issue: The simplicity and single-threaded nature of the processing logic might be the source of the problem.

At this point, it can be preliminarily determined that the proxy’s processing capability is insufficient. After carefully reviewing the proxy’s code, it was found that the proxy uses a single-threaded epoll event loop to receive and synchronously forward packets.
With additional logging and statistics, it was observed that the processing time for each packet is approximately 20-50 microseconds. Therefore, the single-threaded synchronous processing can handle 20,000 to 50,000 packets per second. If the number of packets exceeds this processing capacity, packets are likely to be dropped.

verification

The optimal solution would be to refactor the proxy logic to use multi-threading and asynchronous processing. This can significantly increase the proxy’s throughput, but it requires substantial changes to the code logic. Therefore, I initially started multiple processes simply to quickly validate by listening on multiple ports, and found that the QUIC server can reliably maintain hundreds of thousands of connections.

references

lost-multicast-packets-troubleshooting

linux-udp-packet-drop-debug

introductuon

In the mobile internet era, the demand for real-time and interactive services has surged, making long connection services essential for applications. Unlike short connections, which follow a request-response model, long connections keep a network data channel open between the application and the server for continuous, full-duplex data transmission, allowing the server to push data to users in real-time.

Long connection services must achieve low latency, high concurrency, and high stability, which can be challenging and costly if each business maintains its own service. Therefore, the unified long connection project aims to provide a comprehensive solution, offering secure, high-concurrency, low-latency, easy-to-integrate, and cost-effective long connection services for various businesses.

background

Unified Long Connection Service

The primary goal of the unified long connection service is to offer businesses a secure, high-concurrency, low-latency, easily integrable, and cost-effective long connection system. Key objectives include:

  • Supporting major Baidu app scenarios such as live streaming, messaging, PUSH, and cloud control with secure long connection capabilities.
  • Ensuring high concurrency, stability, and low latency, maintaining the system’s professionalism and advanced nature.
  • Enabling multiple business long connection reuse, reducing the cost and burden of establishing and maintaining connections.
  • Providing a straightforward integration process with clear external interfaces for quick business integration.

background

Challenges

To build a long connection service that meets business needs, the unified long connection service faces several challenges during its design, development, and maintenance. These challenges primarily fall into two categories:

Functionality Implementation

The main challenge in designing a long connection service is defining clear boundaries between the unified service and individual business integrations. Unlike dedicated services for specific businesses, the unified service must support multiple businesses sharing a single long connection. This requires accommodating various business requirements and scenarios while avoiding excessive business logic in the unified service to ensure scalability and future development.

Typical business requirements for long connection services include:

  • Establishing, maintaining, and managing connections.
  • Forwarding upstream requests.
  • Pushing downstream data.

During data transmission, the service must support different data protocols and push models depending on the business type:

  • Messaging: Private messages and small group chats (500-1000 members),
    primarily using unicast and batch unicast push modes with varying push frequency and concurrency.
  • Live Streaming: Multicast to millions of viewers with high push frequency.
  • cloud control: Sending messages to fixed groups in batch unicast mode
  • PUSH Notifications: Sending messages to fixed groups in batch unicast mode with lower push frequency.
business push scenarios push ups frequency
Messaging unicast / batch unicast 10K level high
live stream group cast 10M level high
cloud control batch cast 1M level low
push batch cast 1M level low

Consequently, the unified long connection service must provide the following capabilities:

  1. Connection establishment, maintenance, and management.
  2. Upstream and downstream data forwarding, accommodating different business data protocols.
  3. Downstream push, supporting unicast, batch unicast, and broadcast.

functional goals

Performance Optimization

The unified long connection service must achieve high concurrency, high availability, and high stability to serve Baidu’s apps. Specific performance aspects include:

performance standard desc
concurrent connections 10M level horizontal scaling
upstream qps 1M Level horizontal scaling
downstream qps 10M Level horizontal scaling
latency 10ms level

Connection QPS, latency, and success rate

Long connections need to be established quickly when the app opens and maintained while the app is active. The service must support thousands of QPS for connection establishment and millions of concurrent online connections, with horizontal scaling capabilities. Connection establishment is fundamental, with success rate and latency being critical.

Upstream request QPS, latency, and success rate:

Once the connection is established, business requests need to be forwarded to the backend, supporting at least tens to hundreds of thousands of QPS, with horizontal scaling.

Downstream request QPS, latency, and success rate:

Depending on the business scenario, downstream requests may involve batch unicast or multicast. Generally, batch unicast should support millions of UPS, and multicast should support tens of millions of UPS, with horizontal scaling.

Overall Architecture

overall

Architecture

The long connection service consists of four main components: the Unified Long Connection SDK, the Control Layer, the Access Layer, and the Route Layer.

protocol

The vision of the unified long connection is to support multiple services reusing a single long connection. This allows different business data protocols to be compatible on the same connection and differentiates requests for upstream and downstream business data forwarding.

This is achieved through a private long connection data protocol, which consists of three parts:

  1. Protocol Header: Includes protocol identification, protocol version, etc.
  2. Common Parameters: Device identification, application identification, business identification, request metadata, etc.
  3. Business Data: Custom business data compatible with different business data protocols.
    The long connection service only handles forwarding and does not involve specific business details.

The approximate format of the protocol is as follows (not an actual protocol):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Protocol {
Header {
Protocol ID (type) = value,
Protocol Version (type) = value,
},
Common Parameters {
Device ID (type) = value,
App ID (type) = value,
Business ID (type) = value,
Request Metadata (type) = {
key: value,
key: value,
...
},
},
Business Data {
Custom Data 1 (type) = value,
Custom Data 2 (type) = value,
...
}
}

control layer

Control Layer’s Main Functions:

  1. Verify device legitimacy and determine access strategies before connection establishment.
  2. Generate and verify tokens for device authentication.
  3. Assign access points based on client properties.
  4. Manage small traffic control strategies.

control layer

Access layer

The main functions of the access layer include:

  1. Peer Communication: Establishes, maintains, and releases long connections with the SDK.
  2. Connection Management: Manages connections and maps connection IDs to connection information.
  3. Group Management: Manages connection groups and maps group IDs to connection information.
  4. Upstream Forwarding: Forwards business requests to the backend and returns responses to the SDK.
  5. Downstream Pushing: Receives push requests and sends them to the corresponding SDK.

connection state management

state
Long connections require validation of their legitimacy and effectiveness, as well as rapid response to anomalies. To achieve this, Unified Long Connection employs a state machine. This mechanism clearly defines the various states a long connection can assume during its lifecycle, the actions each state can trigger, and the conditions under which transitions occur.

For instance, before transmitting data, a connection must undergo login validation to ensure its legality. Once authenticated, the connection can engage in data transmission. In case of anomalies post-login, such as data format issues or network disruptions, the system triggers connection invalidation and initiates a reconnect process.

The state machine simplifies the development logic for managing long connection states, ensuring clear definitions of each state and transition condition. This approach helps prevent situations where connections cannot recover due to unknown reasons.

Multi-protocol support

Long connections rely on protocols like TCP, TLS, QUIC, and WebSocket. Different scenarios use different protocols; for example, NA clients use TCP and TLS, while mini programs and web clients use WebSocket. To adapt and enhance connection quality, we support multiple protocols.

Connection Layer: Manages specific protocols (e.g., TLS, WebSocket, QUIC) and provides a unified data interface for the session layer. New protocols are adapted here without affecting session logic.
Session Layer: Maintains long connection business logic (e.g., request forwarding, downstream pushing) and interacts with the connection layer, unaware of specific protocol details.

Clients select protocols and access points based on their conditions (e.g., client type, network type, device quality).

multi-protocol

Advantages:

  1. Isolates business logic from protocol details, simplifying support for multiple protocols.
  2. Clients choose protocols based on conditions, improving connection quality.

upstream

request forward

After the access layer identifies the source of business data, it forwards this data to the business server via RPC and sends the server’s response back to the client. Along with business request data, it includes long connection public parameters. If needed, the access layer can also notify the business server in real-time about connection status changes, like disconnections, for the server to take appropriate actions.

request forward

down stream

There are mainly two downstream cast types:

  • batch unicast: pushes messages to specific devices by mapping device IDs to connection information.
  • group cast: sends the same message to multiple users by managing connection groups for efficient distribution.

unicast/batch unicast

Unicast Push: The server pushes a message to a specific device by determining its connection instance and connection ID. This involves mapping the device ID to connection information (instance IP + connection ID). The main task is to identify the device ID for the target user:

Device-oriented scenarios: Push directly via an interface.
unicast

group cast

Groupcast Push: Used for scenarios like live streaming, where the same message is sent to many users. The routing layer maintains connection groups, mapping group IDs to their connections. Businesses control group creation, and clients join or leave groups. Once established, the long connection service distributes messages to all connections in the group.

To use connection groups, the business needs to:

  1. Create connection groups.
  2. Manage client joins and leaves.
  3. Push messages using connection IDs.
    unicast

summerrise

unified long connection system now supports tens of millions of concurrent connections and handles millions of UPS bulk unicast and multicast messages with real-time scaling. Since launch, it has remained stable, successfully managing high-concurrency events without impacting other services. Overall, the project has met service quality expectations.

Key Insights:

  1. Requirements Analysis: Clearly define boundaries between requirements and business logic to maintain service stability.
  2. Technical Design: Opt for simple solutions that meet clear requirements. Focus on stability and high performance rather than complex solutions.
  3. Operations: Balance single-instance performance with maintenance needs. Multiple smaller instances often offer better stability and resource efficiency.

references

千万级高性能长连接Go服务架构实践 by glstr

Many thanks to the original author, glstr

0%