How we develop DMP for Taxi, Food, and Lavka

RU / Day 3 / 17:15 / Track 3

The Yandex.Go team is developing a data management platform, Data Management Platform, DMP as a service for offline and near real-time data processing in Taxi, Food, and Lavka.

Vladimir will talk about the motivation you need to develop your own ETL tool, about transforming ETL and DWH into DMP. The speaker will share what problems arise during the development of DMP and tell about the experience of solving them.

Currently in Yandex we have:

  • more than a thousand data transformation processes that are launched hundreds of thousands of times a day;
  • Data Lake on YT (in-house analog of Hadoop) over 1PB in size with 100TB monthly increments;
  • Data Warehouse on Greenplum with 0.5PB effective space;
  • Tableau, OLAP cubes in MS SSAS, and analytical tools for JupyterHub.

Platform users: 4 teams of data engineers, several teams of data analysts, and backend developers. They are preparing data for analytics, management, and operational reporting, ML, and use in runtime applications.

The structure of the talk:

  • a little context — storage, stack, and work pattern;
  • ETL framework, its internals, and features;
  • life of a data engineer, analyst, and backend developer on the Yandex platform;
  • internal arrangement of individual tools and parts of the platform.