Machine Learning Mastery - complete courses python files - combined with Kaggle Courses jupyter files.
Kaggle is the most popular platform for Data Science. It has multiple free datasets, projects that you can use for practice, and competitions. It also has a helpful community where you can share your thoughts and learn new things. But the best feature of Kaggle is Kaggle Learn. Even if you don’t know anything about data science, you can learn all the basics from Kaggle Courses and then move on to sharpening your skills by doing projects.
In this repository, the Kaggle learn course tutorial and excerses nootbooks(.ipynb) are available which I have done and earned completion certificates. The Kaggle datasets are avalable in the inputKaggle folder. The Mastery datasets are avalable in the inputMastery folder. The courses structure is as follows:
pip install -r requirements.txt
P01. Python Basics
Functions, Lists, Strings and Dictionaries
P02. Guessing
Guessing the number - User has to guess the number picked by computer
P03. Age
User introduces age (can be decimal) and gets the age in seconds
P04. PriceOfAChair
BeautifulSoup used to download a page and then individual data is obtained from the page
P05. RandomNumbers
Uses numpy pseudorandom number generator to generate random numbers between 1...45, 1...20
P06. Dictionary
Interactive dictionary - uses data.json and displays information about the words introduced.
It has similarities with how Large Language Models (LLMs) work : data lookup, fuzzy matching, and user interaction.
LLMs are of course more powerfull : use neural network to represent text (not a json file), can find patterns and reason over the input (not just retrieving data)
are scalable and can generalize beyond its knowledge base.
K02. Intro To Machine Learning
Starts with DecisionTrees then gets to RandomForest which has the best performance
K03. Pandas
Uses pandas to read Wine data, describe it, fillna and work with columns
K04_z0. Intermediate Machine Learning
Uses 4 RandomForest models from point 2 to train , find best model then generate a submission
K04_z1. Housing Prices Competition
Compare DecisionTreeRegressor with RandomForest model - the best is RandomForest then generate a submission
K04_z2. Pipelines
Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
K04_z3. XGBoost
Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble. We use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model parameters so that adding this new model to the ensemble will reduce the loss.
K05_z0. Data Vizualization
Seaborn, Line Charts, Custom Styles, Heat Maps
K05_z1. Breast Cancer Detection
Histograms for benign and maligant tumors, KDE plots
K06. Feature Engineering
Features, Clustering with K-Means, Principal Component Analysis
K07. Data Cleaning
Minmax_scaling, Normalization, Remove trailing white spaces, fuzzywuzzy closest match
K08. Intro to Deep Learning
Activation Layer, relu, Plot
K09. KerasGradient
Preprocessor, Transformer, Added loss and optimizer, Plot
K10. KerasUnderfitOverfit
Do a "Grouped" split to keep all of an artist's songs in one split or the other - prevents signal leakage. Simple Network - linear model underfit. Added three hidden layers - overfit. Added early stopping callback.
K11. BinaryClassification
In Regression, MAE = distance between the expected outcome and the predicted outcome. In Classification Cross-Entropy = distance between probabilities. Sigmoid activation - covert the real-valued outputs produced by a dense layer into probabilities.
K12. IntroToSQL
BigQuery, Stackoverflow, posts_questions INNER JOIN posts_answers
K13. AdvancedSQL
BigQuery UNION, Analytic Functions, Nested and Repeated Data, Efficient Queries
M000. Notes
M001. Probability
Gaussian Distribution, Bayes, cross entropy H(P, Q), Naive classifier, Log Loss, Brier score
M002. Statistics
Gaussian Distribution and Descriptive Stats, Pearsons correlation,
Statistical Hypothesis Tests, Nonparametric Statistics
M003. Linear Algebra
Vectors, Multiply vectors, Matrix, Transpose Matrix, Invert Matrix,
Matrix decomposition, Singular-value decomposition, Eigen decomposition
M004. Optimization
Basin Hopping Optimization, Multimodal Optimization With Multiple Global Optima, Gradient Descent, Gradient Descent Graph, Grid Search,
M005. Python Machine Learning
Classification and regression trees, DecisionTreeClassifier, line plot, bar chart, histogram, box and whisker plot, scatter plot
M006. Python Project Iris
Box and whisker plots, Histograms, Scatter plot matrix, Split-out validation dataset, Spot Check Algorithms, Make predictions an evaluate them
M007. Machine Learning Mini Course
Pima Indians diabetes, Scatter Plot Matrix, Standardize data (0 mean, 1 stdev), Cross Validation - Evaluate and LogLoss, KNN Regression, Grid Search for Algorithm Tuning, Random Forest Classification, Save Model Using Pickle
M008. Time Series Forecasting
Data Visualization, Persistence Forecast Model, Autoregressive Forecast Model, ARIMA Forecast Model,
M009. Time Series End To End
Test Harness, Persistence, Data Analysis, ARIMA Models, Model Validation, Make Prediction, Validate Model
M010. Time Series End To End Joker
Test Harness, Persistence, Data Analysis, ARIMA Models, Model Validation, Make Prediction, Validate Model
M011. Data Preparation
Fill Missing Values With Imputation, Select Features With RFE, Scale Data With Normalization, Transform Categories With One-Hot Encoding, Transform Numbers to Categories With kBins, Dimensionality Reduction With PCA,
M012. Gradient Boosting
Monitor Performance and Early Stopping, Feature Importance with XGBoost, XGBoost Hyperparameter Tuning,