Data engineering
Study Course Implementer
Dzirciema street 16, Rīga, szf@rsu.lv
About Study Course
Objective
This course aims to provide business and project managers with an understanding of the fundamentals of data engineering and its importance in modern business. As part of the course, participants will gain knowledge about data flow and data processing processes, which will help them plan and manage projects that use data more successfully, as well as understand the requirements and challenges in creating and maintaining data infrastructure.
Preliminary Knowledge
In order to successfully participate in this data engineering course, participants should have a basic understanding of computer science and IT infrastructure, as well as basic knowledge of databases and data analysis. An understanding of business processes and how data is used to make decisions would also be helpful. Knowledge of project management to better oversee and coordinate data projects from a business perspective will be an advantage.
Learning Outcomes
Knowledge
1.Describe the role and responsibilities of the data engineer and analyse aspects of cooperation with IT specialists and business units.
2.Explain the structure of data flows and compare EV and ELT processes by assessing their benefits and constraints in different contexts.
3.Analyse the structures of data storage systems and compare the suitability of SQL and NoSQL databases for different processing scenarios.
4.Explain the basic principles of batch and streaming data processing and assess their applicability to IoT data processing and telemetry analysis situations.
Presentation on the topic studied
5.Demonstrate understanding of the operation of distributed computing systems (Spark, Hadoop) and analyze their use in processing large amounts of data.
6.Compares the functionality of key cloud services (AWS, GCP, Azure) and evaluates their usability in different data engineering contexts.
Presentation on the topic studied
7.Describe data integration processes and identify best practices in data quality assurance to maintain accuracy and consistency.
8.Identify key tools and technologies in the data processing ecosystem and explain their role in different environments (local, cloud, etc.).
Presentation on the topic studied
9.Analyze data storage room architecture, describe dimensional modeling, and explain the role of OLAP processes in data analysis.
10.Explain the architecture of data lakes and assess best practices in data storage and access in data lakes.
11.Demonstrate knowledge of real-time data processing technologies (Apache Kafka, Flink) and explain their suitability for telemetry data analysis.
12.Explain the planning, monitoring and implementation stages of data engineering projects and analyse the role of communication in their successful execution.
Skills
1.Skills to work with data flows, data processing and integration tools (Apache Spark, Hadoop, Apache Kafka, Airflow, etc.) and databases (MySQL, PostgreSQL, MongoDB).
2.Skills to work with cloud service platforms and use cloud infrastructure solutions to store, process, and analyze data.
3.Skills to develop and implement data quality assurance plans such as validation and purification processes.
4.Skills to optimize data flows by improving performance and efficiency.
Competences
1.Ability to identify problems in data integration, storage and processing, as well as ability to offer effective solutions using appropriate technologies.
2.Ability to work effectively with other data engineers, analysts, developers, and project leaders to achieve common goals.
3.Competence to manage the data infrastructure by ensuring its efficient operation, compliance and security.
4.Ability to use up-to-date technologies and techniques such as artificial intelligence and machine learning to improve data processing processes.
Assessment
Individual work
|
Title
|
% from total grade
|
Grade
|
|---|---|---|
|
1.
Presentation on the topic studied |
-
|
Test
|
|
Each of the students will be given a topic to learn independently and be able to present. |
||
Examination
|
Title
|
% from total grade
|
Grade
|
|---|---|---|
|
1.
Exam |
-
|
10 points
|
Study Course Theme Plan
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Real-time data processing
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data pipelines
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Design and architecture of data warehouses
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Big data processing, distributed computing (Spark, Hadoop)
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data Processing Ecosystem
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data storage systems and databases.
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data engineering project management
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Batch VS Streaming data processing, telemetry and IoT data
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Big data processing, distributed computing (Spark, Hadoop)
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data lake structures and best practices
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data storage systems and databases.
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Big data processing, distributed computing (Spark, Hadoop)
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Batch VS Streaming data processing, telemetry and IoT data
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data lake structures and best practices
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Design and architecture of data warehouses
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data engineering project management
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Clod computing (AWS, Google Cloud, Azure)
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Batch VS Streaming data processing, telemetry and IoT data
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Clod computing (AWS, Google Cloud, Azure)
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data Engineer Role and Responsibilities
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data pipelines
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data storage systems and databases.
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data integration and data quality assurance
|
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data integration and data quality assurance
|
Bibliography
Required Reading
Kleppmann M. 2017. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable SystemsSuitable for English stream
Akidau T., Chernyak S., Lax R. 2018. Streaming Systems: The What, Where, When, and How of Large-Scale Data ProcessingSuitable for English stream
Dutt D.G. 2019. Cloud Native Data Center NetworkingSuitable for English stream
Akerkar R. 2014. Big Data: Principles and Paradigms (akceptējams izdevums)Suitable for English stream
Krishnan K. 2013. Data Warehousing in the Age of Big Data (akceptējams izdevums)Suitable for English stream
Additional Reading
Glass R., Callahan S. 2014. The Big Data-Driven Business: How to Use Big Data to Win Customers, Beat Competitors, and Boost ProfitsSuitable for English stream