The largest European publishing group Axel Springer, known for brands such as Bild, Welt, Business Insider, or Politico, has probably crossed everyone’s path. And in the case of Axel Springer, we can speak of Big Data, with over a petabyte of data being processed daily. The National Media and Tech (NMT) at Axel Springer SE is responsible for all topics relating to Tech & Product. Within this, the Data Section is responsible for all topics relating to data products.
Axel Springer has been migrating to the Big Data platform Foundry (Palantir) for four years and, after an initial phase in which the first teams brought their data pipelines into the platform, is in the process of migrating all existing data products as well as building all new ones directly in the platform.
To support colleagues in the migration process, to ensure that a unified, cross-project structure is created and maintained, there is the One Data Platform team of which kreuzwerker was also a part.
One Big Data platform for different users
Foundry is a Big Data platform and describes itself as
Foundry is a highly available, continuously updated, fully managed SaaS platform that spans from cloud hosting and data integration to flexible analytics, visualization, model-building, operational decisionmaking, and decision capture.
The platform thus takes over all infrastructural tasks and offers the data engineers, data scientists and data analysts who work with it the possibility to concentrate completely on their tasks that are directly related to the data. However, this does not mean that Foundry takes over everything. For example, while Foundry controls where data is stored (including backups, retention time, etc.) NMT staff must determine how data is stored (what is the schema, is data partitioned). In Foundry there are different applications to aggregate data, those that do not require any code, as well as those where you only need to write the ETL transformations. To have maximum flexibility and write the whole pipeline and tests in code, one can also create repositories. These different applications allow a variety of users with different programming skills to work together on data projects. There are also applications for data health checks, visualization and even those that allow you to build your own applications based on your own data. An important technology used here is Apache Spark, a framework for distributed computing. This is based on the fact that the data is stored and processed in a distributed manner and thus a high degree of parallelization takes place. This allows each data pipeline to scale with more computers as the amount of data grows (called horizontal scaling).
A common platform
Giving AS Data teams access to the platform and asking them to migrate to it and develop their new products there is a start - but it also quickly leads to chaos. That’s why we have the One Data Platform Team (ODP), which supports the different teams in this process and is also there to collect the knowledge of the different teams, put it into a more general context and then distribute it to everyone. The goal is for the other teams to be able to produce optimal and efficient data products and for these to fit together across the Data Section (whether because data is also to be used by other teams, or because there are standards so that teams can help and benefit from each other by spreading their best practices).
Rather than pre-defining a big migration plan and building a rigid set of rules (which risks never being implemented that way), kreuzwerker and the ODP went to different teams or helped them migrate or build their data, ETL pipelines, ML models, reporting and applications on Foundry. With this approach, other employees were trained in Foundry, Spark, or ETL design, and at the same time, the ODP team also had various experiences with the platform.
Individual coaching for a sustainable good code base
In addition to joint code reviews and knowledge sharing sessions, kreuzwerker focused on coaching. Thereby kreuzwerker went into different teams and completed tasks together with other Axel Springer employees. Since Axel Springer has interdisciplinary teams for different data products, the level of knowledge in these teams is often very different. So it was sometimes a matter of teaching engineering best practices, such as readable and tested code and health checks for data. On the one hand, this was valuable for the employees, but also for the ODP team, which is interested in how the projects on the data platform should be, so that even newcomers can quickly find their way around.
In other cases, it was a matter of optimizing computationally intensive pipelines. Here, kreuzwerker paid attention to analyzing the pipelines, avoiding redundant steps and storing and reading the data as efficiently as possible. Spark provides various storage options here, such as partitioning and bucketing. Further, the ODP team and kreuzwerker made sure that the resources available were used optimally and analyzed the pipelines based on usage metrics.
After a period in which the ODP together with kreuzwerker accompanied many teams, implemented data products and gathered and discussed experiences, common project structures were defined for the entire Data Section so that it was not necessary to think about each project anew. Another important pillar for the platform was building a data catalog, as many teams share common data sources or the data produced by one team serves as input for other projects.
When kreuzwerker started working at Axel Springer, a platform existed and was already being used by some. By building an ODP team that kreuzwerker contributed to, Foundry became not only a tool that is used for calculations, but Foundry became a data platform for the whole Data Section of Axel Springer.