Data Pre-processing
Study Course Implementer
Riga, Anniņmuižas boulevard 26a, 1st floor, office 147a and b, fizika@rsu.lv, +371 67061539
About Study Course
Objective
The objective of the data pre-processing study course is to provide students with essential skills, to prepare raw data for analysis. The main objectives include: Understanding data pre-processing: to understand the importance of data pre-processing and the basics of data analysis workflow. Data cleaning: to learn methods how to process missing values, remove duplicates, and correct errors to ensure data accuracy and consistency. Data transformation: to transform data into suitable formats for analysis, including normalisation, scaling and categorical variable encoding. Feature engineering: to create new features from existing data to improve model performance. Invalid data processing: to identify and manage invalid data to prevent deviations during analysis. Data integration and reduction: to combine data from different sources and reduce size for effective analysis. Practical experience: to obtain practical experience with real-world datasets using industry standard tools and software. Best practices and tools: to learn best practices and familiarise with tools and libraries such as Python’s Pandas, R and SQL. Preparation for improved analysis: to ensure readiness to perform additional data analysis tasks such as machine learning and statistical analysis. Ethical considerations: to discuss ethical aspects, including data privacy and security during pre-processing. At the end of the course, students will be able to convincingly prepare raw data for different analytical applications, ensuring that they are clean, well-structured and ready to use.
Preliminary Knowledge
Knowledge of informatics at secondary school level.
Learning Outcomes
Knowledge
1.After completing the “Data Pre-Processing” study course, students will gain in-depth knowledge of data pre-processing methods and techniques in various data formats and carriers, and understand the importance of data quality and its impact on data analysis.
Skills
1.During the study course, students will develop practical skills in importing, cleaning, transforming, and extracting features from various data sources and formats. They will be able to process missing values, detect anomalies, and address data imbalances.
Competences
1.Having completed the study course, students will be competent to perform a full cycle of data pre-processing across different projects, effectively addressing real-world problems, be able to adapt to different data types and processing challenges, develop automated solutions, and prepare data for further analysis and modeling. Students will be prepared to work in the fields of data science and analytics, applying acquired knowledge and skills in a professional environment.
Assessment
Individual work
|
Title
|
% from total grade
|
Grade
|
|---|---|---|
|
1.
Individual work |
30.00% from total grade
|
Test
|
|
Students independently perform practical tasks and submit practical work reports in the e-learning environment. |
||
Examination
|
Title
|
% from total grade
|
Grade
|
|---|---|---|
|
1.
Examination |
70.00% from total grade
|
10 points
|
|
Develop a project in which students perform data preprocessing on a dataset. Present the project and evaluate the results. |
||
Study Course Theme Plan
-
Lecture
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
2
|
Topics
|
Data Science Foundations (L1). PACE Strategy. Data preparation for analysis. Role of pre-processing in data analysis and machine learning processes. From raw data to finished data: key steps and methods. Data types and formatting.
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Basics of data preprocessing (P1). From output data to prepared data. Getting data from different sources (CSV, Excel, SQL, API). Initial data research and analysis. Evaluation of features and initial problems. Introduction to using Google Colab.
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Data cleaning and preparation for analysis (P2). Technical quality data. Identification and filling of data gaps (imputation). Elimination of duplications and inappropriate values. Data quality assurance methods.
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Missing data and strategies for handling them (P4). Data organization skills. Data transformation and manipulation. Data filtering, selection and grouping. Combining data from multiple sources.
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Duplicates and consistency (P5).
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Data type conversion and unit conversion (P6).
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Descriptive statistics as quality control (P3). The use of statistics for data preprocessing in quality control, including measures of central tendency, dispersion, and shape of distributions to assess data quality. Visual tools such as histograms, box plots, and control charts, as well as correlation and covariance to identify trends, relationships, and anomalies in data.
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Filtering and selection (logical filters, subsets) (P7).
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Harmonization of categories and coding (P8).
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Date/time fields and creating derived variables (P9).
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Filtering outliers and erroneous values (P10).
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Practical pre-processing of data (P11). Applications in real life. Practical projects: preparation and analysis of data in different sectors (finance, medicine, transport). Cleaning and transformation of data in real projects. Preparation of data for obtaining final results.
|
-
Class/Seminar
|
Modality
|
Location
|
Contact hours
|
|---|---|---|
|
On site
|
Auditorium
|
3
|
Topics
|
Project (P12). Data set data preprocessing. Data cleaning, transformation and preparation for data analysis. Project presentation, evaluation of results and application of acquired skills.
|
Bibliography
Required Reading
Hands-On Data Preprocessing in Python. EBSCOhost Ebook Academic Collection, 2022.Suitable for English stream