Modélisation statistique pour les données complexes et le Big Data
3rd year Bachelors degree course, IUT Nice Côte d’Azur, BUT Science des Donne, S6, 2024
Project on statistical modeling for complex data and big data, for 3rd year BUT students.
Cours description
Objective: Statistical modeling based on a given dataset, following all the steps required to understand its variables, process them and model them for a given objective.
Method:
- Individual work, in groups of 2 or 3 students (chosen at random), using Python.
- Objectives will be progressively set, every 2 sessions each group will have to provide the progress of their work (in the form of a notebook).
Evaluation:
- Continuous assessment (1/5): Work progress.
- Written report (2/5): A written report (clean) in which all the results obtained are brought together, commented on and put into perspective. Due on 30/05.
- Oral exam (2/5): 10/06 from 8am. Each group will present its work in the form of a talk (with support). 15-minute presentation + 15-minute Q&A.
1st project: binary classification
Data (training and test sets)
- G1 Hadjadj-Tognaccioli Data G1 (.zip)
- G2 Billon-Minot Data G2 (.zip)
- G3 Camara-Trochon Data G3 (.zip)
- G4 Godin-Langlois Data G4 (.zip)
- G5 Frandon Data G5 (.zip)
1st Step: data analysis and preparation
Dataset size and type. Are there any missing data? If so, we will eliminate them (for now).
How many classes and elements/class?
Selection of the most significant variables: explore and suggest several methods. Are there any collinearities? Which variables will you retain? How are the selected variables distributed?
Pay particular attention to the graphic visualization of your results.
2nd Step: choice of classification model, hyperparameters tuning
Preparation of training dataset for cross-validation (choice of number of folds, split).
- Several models can be adapted to the task at hand: logistic regression, SVM, random forests, xgboost,…
- Initialize the models
- Test them separately (k-fold cross-validation)
- Tune their hyperparameters
Compare the performance of the different models tested (each with its final choice of hyperparameters).
Which model/parameters did you choose?
Could a voting strategy improve predictions? Test it.
With the final model chosen, predict the labels of the test dataset.
- Pay particular attention to the graphical visualization of your results.
2nd project: multiclass classification
Data
Download the brain dataset (5 classes - Brain_GSE50161.csv) from CuMiDa
The original paper describing the CuMiDa project is available here
Steps
Preprocess your data using MinMax
Split the dataset in a train and test sets, accounting for stratification.
The dataset appears to be unbalanced: a common strategy to cope with this consists in balancing classes through oversampling of the minority classes during training. For instance you can use SMOTE - Synthetic Minority Over-sampling Technique. An implementation of widely used oversampling methods is available in imbalanced-learn.
Perform feature selection/dimensionality reduction: how many features/dimensions should be kept?
Repeat the steps followed for the first project to propose an optimized classificator (additional classification strategies can be tested).
Use of cross validation to optimize the hyperparameters and conclude about the stability of the results.
Perfor a final test on the test dataset.
Pay particular attention to the graphical visualization of your results.