In the field of data, two key roles stand out: Data Engineer and Data Scientist. These two roles are often confused because they are part of the same field of activity and work closely together. In some companies, they are sometimes filled by the same person playing a cross-functional role. However, they perform distinct, albeit complementary, tasks.
We'll see how the two of them, these roles enable a complete data chain for collecting, preparing, and analyzing data, followed by the industrialization and monitoring of a model.
Two professions, complementary talents that complement each other
The role of the Data Engineer: builder of data infrastructure
The Data Engineer is responsible for creating and maintaining pipelines that ensure the reliable processing of data flows. This involves a series of automated steps that enable:
- Perform data ingestion, transformation, and cleansing. ETL (Extract – Transform – Load) or ELT are one of the most common approaches for organizing these processes.
- Manage storage and organization of data in appropriate environments (data lake, data warehouse, databases),
- Structure and prepare data for training and deploying a model.
- Power the processes necessary to execute the model in production.
These pipelines ensure the reproducibility and scalability of processing when moving from a POC to the production of the model.
The Data Engineer must also:
- Optimize performanceby parallelizing processing to reduce execution times and costs, particularly in cloud environments.
- Maintain data quality, reliability, and availabilityby implementing monitoring and disaster recovery mechanisms.
- Manage the architecture of pipelines and data systemsso that it is tailored to requirements and functional in an operational context.
The work of the Data Engineer is essential for the Data Scientist: without them, accessing, cleaning, and preprocessing data would be very time-consuming, and putting it into production would be much more complex.
The role of the Data Scientist: explorer & revealer of meaning
Data scientists explore data in search of insights and trends that could give meaning to the data and extract value from it.
By combining his skills in statistics and machine learning—and more recently in generative AI—he builds models from large volumes of data to meet the needs of businesses.
To do so, it carries out the following actions:
- Data preparation and exploration,
- Model training: iterative experiments with different types of models, hyperparameters, and variables.
- Performance evaluation: testing on validation data to to select the best model, based on metrics.
Once the model has been trained and put into production, the results it produces must be interpretable and usable by the business. The model's output must therefore be meaningful for the use case it addresses.
Without Data Scientists, the work of Data Engineers would lose some of its meaning, as the prepared data would not be used to create value.
Overlapping areas and shading
Some of the tasks performed by data engineers and data scientists overlap. For example, as mentioned above, one of their shared responsibilities is data preprocessing (cleaning, transformation).
In reality, the Data Scientist defines the relevant transformations and features, while the Data Engineer implements and optimizes in an automated and scalable manner in pipelines.
They must therefore communicate extensively to ensure the quality of the data that will be used by the model.
The end-to-end journey: from raw to operational model
The transition from raw data to a production model is based on key steps, carried out jointly by the Data Engineer and the Data Scientist.
Ingestion & continuous ingestion
The first step in a data/AI project is data ingestion, often via connectors that enable data to be collected from various sources.
Data ingestion can be performed by batch (or lots), by streaming (real time), or via a combination of both.
The connector, for example an API, will:
- Authenticate with the data source
- Query the source at regular intervals (batch) or listen continuously (streaming)
- Transmit data to the pipeline or storage space (data lake, data warehouse, etc.)
The implementation of data input pipelines is the responsibility of the Data Engineer. The Data Scientist may intervene to specify the data to be retrieved.
Preparation & processing
The data preparation and transformation phase is first carried out by the Data Scientist, then industrialized by the Data Engineer within a pipeline.
It can account for up to 80% of a data scientist's working time, who must:
- Normalize variables(centering, reduction) in order to bring them to the same scale.
- Manage missing values (deletion or imputation) andidentify duplicates
- Clean up data, especially text data (removal of special characters, extra spaces, standardization)
- Detect and handle outliers
- Encode categorical variables to make them interpretable by algorithms (one-hot encoding, ordinal encoding, etc.).
- Create new variables from existing data or external sources: this isfeature engineering.
Create new variables from existing data or external sources: this is known as feature engineering.
Modeling & experimentation
This step is specific to the Data Scientist. TheTraining of a model is a highly iterative process: the data scientist tests different types of models, hyperparameters, and variables on a training dataset, often using techniques such as cross-validation to avoid overfitting.
He then evaluates the models on validation data — and then on a final test set — using metrics . This allows him to compare the models and select the one that best meets the business need.
The chosen model is not necessarily the most powerful:interpretability, robustness or operational constraints may also influence the choice.
Deployment & production launch
At this stage, the Data Engineer takes the lead. They choose a suitable infrastructure (on-premise server or cloud environment) and the most appropriate inference mode for the use case: batch execution via scheduled jobs via scheduled jobs or in real time via an API.
To launch the necessary jobs, it uses orchestrators such as Airflow and Luigi. These tools allow you to schedule, chain, and execute tasks that may be dependent on each other.
The Data Scientist provides the Data Engineer with the correct version of the model to be deployed, the data preprocessing scripts, and the technical dependencies (versions of Python libraries).
The Data Engineer is then responsible for packaging the model in a reproducible environment (often using a Docker container). Finally, the model is deployed, usually via a CI/CD pipeline.
Monitoring and feedback loops
Monitoring is a key role of the Data Scientist: it involves regularly checking that there is no drift in the model's predictions and input data. Models tend to become less effective over time. Data scientists must therefore set thresholds and define actions to be taken if they are exceeded, such as retraining the model.
The Data Engineer sets up the infrastructure that enables this monitoring. This will allow predictions and labels to be stored in order to calculate metrics. They can also implement dashboards and automate model retraining via pipelines.
A feedback loop can also be implemented, so that the model can correct its errors based on feedback from the business.
Challenges and tensions between engineering and data science
Differences in pace & expectations
Data scientists generally work in an exploratory manner, spending a lot of time iterating and experimenting. They test hypotheses, compare models, and focus on the quality and relevance of predictions.
Data engineers, on the other hand, work at a more project-oriented pace, with deadlines, architectural constraints, and production requirements. They prioritize stability, performance, and robustness in the technical solutions they implement.
Data quality issues & scientific frustrations
If data pipelines are not sufficiently robust, data sets quickly become incomplete, inconsistent, or poorly formatted. The models that depend on them deteriorate, creating real scientific frustration : instead of experimenting and modeling, data scientists must devote a large part of their time to diagnosing and correcting these quality issues.
Missing values, incorrectly filled fields, or inconsistent formats can also generate operational risks once in production. pipelines implemented by the Data Engineer are therefore essential to ensure reliability of data upstream.
Over-engineering vs. under-optimization
Data engineers face the following two challenges:
- Over-engineering : too much technical optimization complicates pipelines, makes maintenance more difficult, and lengthens delivery times.
- Sub-optimization : Conversely, a lack of engineering leads to fragile pipelines, unreliable data, and limited scalability.
He must therefore find a compromiseby building pipelines that are robust but simple, so as not to penalize the work of data scientists and to enable business teams to quickly obtain reliable results.
Liability in case of model error and drift
Liability for errors depends on the origin of the problem:
- Data Engineer : whether the error stems from a faulty pipeline or incorrect data,
- Data Scientist : whether the error results from the model's predictions or from a poor choice of model.
In all cases, correcting the problem requires close collaboration between the Data Engineer and the Data Scientist to ensure the reliability of the system.
Best practices for bringing together two disciplines
Define a service agreement/SLA between engineering and data science
To align practices between Data Engineering and Data Science, it is useful to define a data service agreement or SLA (Service Level Agreement).
This contract formalizes mutual expectations regarding:
- Data data quality : acceptable rate of missing values, management of duplicates, and content validation rules (for example, a mandatory field must not be empty or a date must be valid)
- The data format : available columns, variable types
- The update frequency : daily, hourly, or on demand.
It also specifies the availability of pipelines, the monitoring and alerting system, and the communication process (for example, using RACI for anomaly corrections or change planning).
A well-defined SLA reduces gray areas, limits ambiguities, and significantly improves coordination between the two professions.
Pool tools and documentation
In order for a Data Scientist to understand the origin of the data, as well as its transformations and the pipelines built by the Data Engineer, it is essential to implement tools and shared documentation, for example via Confluence.
These tools include, in particular:
- A data catalog, which lists all available datasets, with their metadata, owners, and descriptions,
- A data lineage system (Altan, Collibra, Amundsen), which traces the complete path of data through pipelines.
Create reusable feature modules
By creating reusable feature modules and standardizing processes, we ensure consistency across variables, reduce errors, and accelerate the transition from the testing phase to production.
These modules prevent data scientists from having to recreate the same transformations for each project, facilitate collaboration between teams, and ensure the maintainability of all projects that use them.
Iterative & integrated workflow experience
Implement an iterative and integrated experiential workflow consists of creating test pipelines for each prototype model. These workflows ensure greater reliability and reproducibility of experiments and enable a gradual transition to production of models.
Respect data governance and compliance
The Data Engineer is primarily responsible for data governance, ensuring the creation of secure pipelines, access management, anonymization, and compliance with confidentiality rules.
Data scientists, for their part, participate in governance of models. They ensure that their analyses and models are traceable, reproducible, and compliant with the same security and regulatory standards.
Together, they ensure that pipelines and models comply with security, compliance, and quality requirements. security, compliance, and data quality requirements.
Conclusion: the keys to successful data engineering and data science data engineering and data science
As we have seen, the professions of Data Engineer and Data Scientist are complementary in carrying out a data/AI project from start to finish. Data engineers ensure the availability, quality, and security of data, while data scientists use this data to create reliable and actionable models.
Given the tensions that may exist—differing expectations, frustration with data quality—they must work closely together and adopt best practices to ensure that projects run smoothly, data quality is high, and models are reliable. This is based on a mutual understanding of each other's needs and effective coordination throughout the data and model lifecycle.
New professions have also emerged in data teams, such as ML Engineer and MLOps Engineer. These profiles play a key role in bridging the gap between data engineers and data scientists. They facilitate deployment, orchestration, and monitoring models in production. Their emergence demonstrates the importance of structured collaboration within data/AI projects.
FAQ - Data engineering and data science
The Data Engineer sets up pipelines that ensure reliable data flow processing. The Data Scientist uses this data to create machine learning models .
These two professions collaborate on a daily basis by sharing tools and documentation and regularly communicating their respective needs. They exchange information in order to adjust data formats and availability, with the aim of improving models and facilitating their industrialization.
It all depends on the maturity of the available data. If there are no reliable and robust data pipelines, it is preferable to recruit a Data Engineer first. On the other hand, if the data is already structured and accessible, and the objective is to extract value and insights from it, recruiting a Data Scientist is more relevant.
The Data Engineer has skills in ETL/ELT tools, databases, and data schema design.
Data scientists have skills in statistics, machine learning, and exploratory data analysis. They are proficient in Python and libraries such as Pandas, Scikit-learn, TensorFlow, and/or PyTorch.
In small organizations, it is possible to combine the roles of Data Engineer and Data Scientist within a versatile profile. This profile is responsible for data collection, preparation, and analysis, from setting up pipelines to developing models.
However, as data volumes and project complexity increase, it becomes necessary to separate these roles to ensure the reliability and performance of data/AI projects.
Machine Learning Engineer - Data Scientist
Data Scientist