Advanced ETL Techniques for Beginners | by ????Mike Shakhomirov | Feb, 2024

On a scale from 1 to 10 how good are your data ingestion skills?

????Mike Shakhomirov
Towards Data Science
Photo by Blake Connally on Unsplash

Data ingestion is a crucial step in data engineering. Data engineers load huge amounts of data into various database systems for further transformation and processing. While dealing with relatively small amounts of data on staging we are in luck not running out of memory, working on production data pipelines with terabytes (or even petabytes) of records often turns into a real challenge. Existing ETL solutions offer automated data loading into a data warehouse we need and often have row-based pricing models. In this story, I would like to discuss how to create a bespoke data-loading solution for our pipelines to enable efficient data loading. We will take a better look into common data ingestion design patterns and typical ways to organise the process. We will reverse-engineer some of the most popular ETL solutions to see how data can be ingested without outages and losses efficiently. I will provide data-loading examples using Python libraries and tools available in the market for free to summarise my findings.

On a scale from 1 to 10 how good are your data loading skills? –

That would be one of my favourite questions during data engineering interviews. I keep looking for talents who know how to build bespoke ETL systems.

Indeed, being able to create a robust data loading system that can process data efficiently, doesn’t fail, doesn’t consume too much memory, can handle various data formats and scales well — this is what marks an experienced data engineer in my opinion. With the abundance of tools available in the market for ETL tasks, we are in luck and don’t really need this. Until the company decides to build this in-house. There might be various reasons for that and one of the obvious ones is security and regulations. Dealing with sensitive data is always challenging and often data must not leave certain regions and/or geographical locations. Another good reason to develop ETL expertise internally is that it saves tons of money in the long run. Having an all-hands software engineer who is experienced with data platform design and knows many ETL tools and frameworks is always great. Companies are hunting for those talents. I…

Source link

Technology

gaitQ and machineMD secure million dollar research grant to monitor Parkinson’s development in UK and Switzerland

Oxford-based medical technology start-up gaitQ and Swiss medical device company machineMD have announced the joint award of a million dollar research grant from Innovate UK and Innosuisse to enable the collection and analysis of critical movement data from people with Parkinson’s (PwP). The grant will fund an 18-month research project that will record movement data […]

Read More
Technology

Take-Two plans to lay off 5 percent of its employees by the end of 2024

Take-Two Interactive plans to lay off 5 percent of its workforce, or about 600 employees, by the end of the year, as reported in an SEC filing Tuesday. The studio is also canceling several in-development projects. These moves are expected to cost $160 million to $200 million to implement, and should result in $165 million […]

Read More
Technology

10 tips to avoid planting AI timebombs in your organization

At the recent HIMSS Global Health Conference & Exhibition in Orlando, I delivered a talk focused on protecting against some of the pitfalls of artificial intelligence in healthcare. The objective was to encourage healthcare professionals to think deeply about the realities of AI transformation, while providing them with real-world examples of how to proceed safely […]

Read More