Veidlapa Nr. M-3 (8)
Study Course Description

Data Pre-processing

Main Study Course Information

Course Code
FK_083
Branch of Science
Other medical sciences; Other Sub-Branches of Medical Sciences
ECTS
5.00
Target Audience
Business Management; Health Management; Management Science
LQF
Level 7
Study Type And Form
Full-Time

Study Course Implementer

Course Supervisor
Structure Unit Manager
Structural Unit
Department of Physics
Contacts

Riga, Anniņmuižas boulevard 26a, 1st floor, office 147a and b, fizika@rsu.lv, +371 67061539

About Study Course

Objective

The objective of the data pre-processing study course is to provide students with essential skills, to prepare raw data for analysis. The main objectives include: Understanding data pre-processing: to understand the importance of data pre-processing and the basics of data analysis workflow. Data cleaning: to learn methods how to process missing values, remove duplicates, and correct errors to ensure data accuracy and consistency. Data transformation: to transform data into suitable formats for analysis, including normalisation, scaling and categorical variable encoding. Feature engineering: to create new features from existing data to improve model performance. Invalid data processing: to identify and manage invalid data to prevent deviations during analysis. Data integration and reduction: to combine data from different sources and reduce size for effective analysis. Practical experience: to obtain practical experience with real-world datasets using industry standard tools and software. Best practices and tools: to learn best practices and familiarise with tools and libraries such as Python’s Pandas, R and SQL. Preparation for improved analysis: to ensure readiness to perform additional data analysis tasks such as machine learning and statistical analysis. Ethical considerations: to discuss ethical aspects, including data privacy and security during pre-processing. At the end of the course, students will be able to convincingly prepare raw data for different analytical applications, ensuring that they are clean, well-structured and ready to use.

Preliminary Knowledge

Knowledge of informatics at secondary school level.

Learning Outcomes

Knowledge

1.After completing the “Data Pre-Processing” study course, students will gain in-depth knowledge of data pre-processing methods and techniques in various data formats and carriers, and understand the importance of data quality and its impact on data analysis.

Skills

1.During the study course, students will develop practical skills in importing, cleaning, transforming, and extracting features from various data sources and formats. They will be able to process missing values, detect anomalies, and address data imbalances.

Competences

1.Having completed the study course, students will be competent to perform a full cycle of data pre-processing across different projects, effectively addressing real-world problems, be able to adapt to different data types and processing challenges, develop automated solutions, and prepare data for further analysis and modeling. Students will be prepared to work in the fields of data science and analytics, applying acquired knowledge and skills in a professional environment.

Assessment

Individual work

Title
% from total grade
Grade
1.

Individual work

30.00% from total grade
Test

Students independently perform practical tasks and submit practical work reports in the e-learning environment.

Examination

Title
% from total grade
Grade
1.

Examination

70.00% from total grade
10 points

Develop a project in which students perform data preprocessing on a dataset. Present the project and evaluate the results.

Study Course Theme Plan

FULL-TIME
Part 1
  1. Lecture

Modality
Location
Contact hours
On site
Auditorium
2

Topics

Data Science Foundations (L1). PACE Strategy. Data preparation for analysis. Role of pre-processing in data analysis and machine learning processes. From raw data to finished data: key steps and methods. Data types and formatting.
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Basics of data preprocessing (P1). From output data to prepared data. Getting data from different sources (CSV, Excel, SQL, API). Initial data research and analysis. Evaluation of features and initial problems. Introduction to using Google Colab.
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Data cleaning and preparation for analysis (P2). Technical quality data. Identification and filling of data gaps (imputation). Elimination of duplications and inappropriate values. Data quality assurance methods.
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Missing data and strategies for handling them (P4). Data organization skills. Data transformation and manipulation. Data filtering, selection and grouping. Combining data from multiple sources.
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Duplicates and consistency (P5).
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Data type conversion and unit conversion (P6).
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Descriptive statistics as quality control (P3). The use of statistics for data preprocessing in quality control, including measures of central tendency, dispersion, and shape of distributions to assess data quality. Visual tools such as histograms, box plots, and control charts, as well as correlation and covariance to identify trends, relationships, and anomalies in data.
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Filtering and selection (logical filters, subsets) (P7).
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Harmonization of categories and coding (P8).
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Date/time fields and creating derived variables (P9).
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Filtering outliers and erroneous values ​​(P10).
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Practical pre-processing of data (P11). Applications in real life. Practical projects: preparation and analysis of data in different sectors (finance, medicine, transport). Cleaning and transformation of data in real projects. Preparation of data for obtaining final results.
  1. Class/Seminar

Modality
Location
Contact hours
On site
Auditorium
3

Topics

Project (P12). Data set data preprocessing. Data cleaning, transformation and preparation for data analysis. Project presentation, evaluation of results and application of acquired skills.
Total ECTS (Creditpoints):
5.00
Contact hours:
38 Academic Hours
Final Examination:
Exam

Bibliography

Required Reading

1.

Hands-On Data Preprocessing in Python. EBSCOhost Ebook Academic Collection, 2022.Suitable for English stream

2.

Data Wrangling with PythonSuitable for English stream

Additional Reading

1.

Foundational Python for Data ScienceSuitable for English stream

2.

Python for Data ScienceSuitable for English stream

Other Information Sources

1.

Preprocessing - Categorical DataSuitable for English stream

2.

PacktPublishing/Hands-On-Data-Preprocessing-in-PythonSuitable for English stream

3.

How to Preprocess Data in PythonSuitable for English stream