Modélisation statistique pour les données complexes et le Big Data

3rd year Bachelors degree course, IUT Nice Côte d’Azur, BUT Science des Donne, S6, 2024

Project on statistical modeling for complex data and big data, for 3rd year BUT students.

Cours description

Objective: Statistical modeling based on a given dataset, following all the steps required to understand its variables, process them and model them for a given objective.

Method:

Individual work, in groups of 2 or 3 students (chosen at random), using Python.
Objectives will be progressively set, every 2 sessions each group will have to provide the progress of their work (in the form of a notebook).

Evaluation:

Continuous assessment (1/5): Work progress.
Written report (2/5): A written report (clean) in which all the results obtained are brought together, commented on and put into perspective. Due on 30/05. (Template)
Oral exam (2/5): 10/06 from 8am. Each group will present its work in the form of a talk (with support). 15-minute presentation + 15-minute Q&A.

1st project: binary classification

Data (training and test sets)

G1 Hadjadj-Tognaccioli Data G1 (.zip)
G2 Billon-Minot Data G2 (.zip)
G3 Camara-Trochon Data G3 (.zip)
G4 Godin-Langlois Data G4 (.zip)
G5 Frandon Data G5 (.zip)

Original paper

1st Step: data analysis and preparation

Dataset size and type. Are there any missing data? If so, we will eliminate them (for now).
How many classes and elements/class?
Selection of the most significant variables: explore and suggest several methods. Are there any collinearities? Which variables will you retain? How are the selected variables distributed?
Pay particular attention to the graphic visualization of your results.

2nd Step: choice of classification model, hyperparameters tuning

Preparation of training dataset for cross-validation (choice of number of folds, split).
Several models can be adapted to the task at hand: logistic regression, SVM, random forests, xgboost,…
- Initialize the models
- Test them separately (k-fold cross-validation)
- Tune their hyperparameters
Compare the performance of the different models tested (each with its final choice of hyperparameters).
Which model/parameters did you choose?
Could a voting strategy improve predictions? Test it.
With the final model chosen, predict the labels of the test dataset.
Pay particular attention to the graphical visualization of your results.

2nd project: multiclass classification

Data

Download the brain dataset (5 classes - Brain_GSE50161.csv) from CuMiDa
The original paper describing the CuMiDa project is available here

Steps

Preprocess your data using MinMax
Split the dataset in a train and test sets, accounting for stratification.
The dataset appears to be unbalanced: a common strategy to cope with this consists in balancing classes through oversampling of the minority classes during training. For instance you can use SMOTE - Synthetic Minority Over-sampling Technique. An implementation of widely used oversampling methods is available in imbalanced-learn.
Perform feature selection/dimensionality reduction: how many features/dimensions should be kept?
Repeat the steps followed for the first project to propose an optimized classificator (additional classification strategies can be tested).
Use of cross validation to optimize the hyperparameters and conclude about the stability of the results.
Perfor a final test on the test dataset.
Pay particular attention to the graphical visualization of your results.

3rd Project: working with missing data

Consider the dataset used for the 2nd project.
Preprocess your data and perform feature selection/dimensionality reduction (using the pipeline defined previously).
Generate different percentages of missing data (e.g. 20%, 40%, 60%), either MCAR or MNAR (censoring).
Several methods can be used to impute missing data: for istance using the mean/median/… (SimpleImputer), using a knn strategy (KNNImputer) or through multivariate imputation by chained equations (MICE-IterativeImputer). Try and compare them in terms of their ability in reconstruction. How much the distribution of the imputed dataset is affected by the chosen imputation strategy and depending on the amount of missingness?
Consider the classification task in presence of missing data in a 2 steps approach: 1) imputation, 2) classification. Compare the classification performances as a function of the amount of missing data in the original set and the choice of the imputation strategy.
Use k-fold cross validation.
Pay particular attention to the graphical visualization of your results.

Share on

Twitter Facebook LinkedIn

Irene Balelli

Cours description

1st project: binary classification

2nd project: multiclass classification

3rd Project: working with missing data

Share on