visit
In the digital age, data has become one of the most valuable assets for businesses. As a leading lifestyle service platform in China, 58 Group has been continuously exploring and innovating in the construction of its data integration platform. This article will detail the architectural evolution, optimization strategies, and future plans of 58 Group's data integration platform based on Apache SeaTunnel.
58 Group has a wide range of businesses, and with the rapid development of these businesses, the scale of data from various business areas such as recruitment, real estate, second-hand housing, second-hand markets, local services, and information security has increased significantly.
58 Group needs to facilitate the flow and convergence of data between different data sources to achieve unified management, circulation, and sharing of data. This involves not only the collection, distribution, and storage of data but also applications such as offline computing, cross-cluster synchronization, and user profiling.
Currently, 58 Group processes over 500 billion messages daily, with peak message processing reaching over 20 million, and the number of tasks reaching over 1600. Handling such a massive volume of data presents significant challenges.
In facilitating the flow and convergence of data between different sources and achieving unified management, circulation, and sharing of data, 58 Group faces challenges including:
The architecture of 58 Group's data integration platform has undergone multiple evolutions to adapt to changing business needs and technological developments.
From 2017 to 2018, 58 Group's data integration platform adopted the Kafka Connect architecture, based on Kafka's data integration, with scalability and distributed processing horizontally expanded, supporting the operation of Workers and Tasks on multiple nodes; workers automatically redistribute tasks to other Workers upon failure, achieving high availability. It also supports automated offset management and Rest API task and configuration management.
However, with the expansion of business volume and diversification of scenarios, this architecture encountered bottlenecks:
Inability to achieve end-to-end data integration.
Heartbeat Timeout: Worker-to-coordinator heartbeat timeouts trigger task rebalancing, causing temporary task interruptions.
Heartbeat Pressure: Workers synchronize with coordinators, tracking worker states and managing a large amount of task metadata.
Coordinator Failure: Coordinator downtime affects task allocation and reallocation, causing task failures and decreased processing efficiency.
Task Pause and Resume: Each rebalance pauses tasks, then reallocates them, leading to brief task interruptions.
Rebalance Storms: If multiple worker nodes frequently join or exit the cluster, or if network jitter causes heartbeat timeouts, frequent Rebalance can significantly affect task processing efficiency, leading to delays.
Given these shortcomings, 58 Group introduced Apache SeaTunnel in 2023, integrating it into the real-time computing platform to freely expand various Sources/Sinks.
Currently, 58 Group's data integration platform, based on the Apache SeaTunnel engine, integrates Source data sources (Kafka, Pulsar, WMB, Hive, etc.), processes them through SeaTunnel's built-in Transform features, and Sinks them to destination databases (Hive, HDFS, Kafka, Pulsar, WMB, MySQL, SR, Redis, HBASE, Wtable, MongoDB, etc.), achieving efficient task management, status management, task monitoring, intelligent diagnostics, and more.
When introducing Apache SeaTunnel, 58 Group needed to perform a smooth migration of the data integration platform to minimize the impact on users or businesses and ensure data consistency, maintaining format consistency, path consistency, and no data loss.
This goal presented challenges, including the cost and risks of migration, such as understanding and confirming the format of each task's data source, and the migration involving multiple steps, which is complex and time-consuming.
To address this, 58 Group took the following measures:
For sources, add RawDeserializationSchema to be compatible with unstructured data.
For destinations, such as using hdfs sink for hive, to be compatible with partition loading and paths.
Develop automatic migration tools:
58 Group also carried out several performance optimizations on the data integration platform, including:
Additionally, 58 Group improved the stability and efficiency of the data integration platform through monitoring and operations automation:
58 Group has clear plans for the future development of the data integration platform:
The architectural evolution and optimization of 58's data integration platform is a continuous process of iteration and innovation. Through continuous technological exploration and practice, 58 Group has successfully built an efficient, stable, and scalable data integration platform based on Apache SeaTunnel, providing strong data support for business development.
In the future, 58 Group will continue to delve deeper into the field of data integration to provide better services for users.