Modélisation statistique pour les données complexes et le Big Data

3rd year Bachelors degree course, IUT Nice Côte d’Azur, BUT Science des Donne, S6, 2024

Project on statistical modeling for complex data and big data, for 3rd year BUT students.

Cours description

Objective: Statistical modeling based on a given dataset, following all the steps required to understand its variables, process them and model them for a given objective.

Method:

  • Individual work, in groups of 2 or 3 students (chosen at random), using Python.
  • Objectives will be progressively set, every 2 sessions each group will have to provide the progress of their work (in the form of a notebook).

Evaluation:

  • Continuous assessment (1/5): Work progress.
  • Written report (2/5): A written report (clean) in which all the results obtained are brought together, commented on and put into perspective. Due on 13/06 (at the end of the session).
  • Oral exam (2/5): 17/06 from 8am. Each group will present its work in the form of a talk (with support). 15-minute presentation + 15-minute Q&A.

Groups (Randomly defined)

  • G1 CROMBET, NIASSY
  • G2 FALAIS, EZ-ZEROUALI
  • G3 VIAUD, CARTON
  • G4 VIGLIETTI, HIRSINGER
  • G5 GOURAR, DIALLO
  • G6 JOUNDI, LOFTI, NOUIRA

Data (training and test sets)

Data G1 (.zip)
Data G2 (.zip)
Data G3 (.zip)
Data G4 (.zip)
Data G5 (.zip)
Data G6 (.zip)

1st Step: data analysis and preparation

  • Dataset size and type. Are there any missing data? If so, we will eliminate them (for now).

  • How many classes and elements/class?

  • Selection of the most significant variables: explore and suggest several methods. Are there any collinearities? Which variables will you retain? How are the selected variables distributed?

  • Pay particular attention to the graphic visualization of your results.

2nd Step: choice of classification model, hyperparameters tuning

  • Preparation of training dataset for cross-validation (choice of number of folds, split).

  • Several models can be adapted to the task at hand: logistic regression, SVM, random forests, xgboost,…
    • Initialize the models
    • Test them separately (k-fold cross-validation)
    • Tune their hyperparameters
  • Compare the performance of the different models tested (each with its final choice of hyperparameters).

  • Which model/parameters did you choose?

  • Could a voting strategy improve predictions?

  • With the final model chosen, predict the labels of the test dataset (to be saved on a .csv and sent with the notebook).

  • Pay particular attention to the graphical visualization of your results.