fillna()
to fill or drop rows with missing data. Ensures models don’t fail due to empty values.train_test_split()
to divide data into training and testing sets for evaluation.np.array()
creates arrays for calculations.pd.DataFrame()
creates structured datasets.These are ways to describe the center of data. Mean is the average, median is the middle value when data is sorted, and mode is the most frequent value. Beginners can imagine summarizing a list of test scores using these measures.
<!-- Example: mean, median, mode --> import statistics data = [70, 80, 90, 80, 100] print("Mean:", statistics.mean(data)) print("Median:", statistics.median(data)) print("Mode:", statistics.mode(data))
Variance measures how spread out data is from the mean. Standard deviation is the square root of variance and shows spread in same units as data. Beginners can imagine how much scores differ from average.
<!-- Example: variance & standard deviation --> print("Variance:", statistics.variance(data)) print("Standard Deviation:", statistics.stdev(data))
Correlation shows how two variables move together (positive or negative). Covariance measures how two variables vary together. Beginners can think of height and weight correlation; if one increases, the other often increases.
<!-- Example: correlation & covariance --> import numpy as np x = [1,2,3,4] y = [2,4,6,8] print("Covariance:", np.cov(x,y)[0,1]) print("Correlation:", np.corrcoef(x,y)[0,1])
Conditional probability is the chance of an event happening given another event. Bayes Theorem helps update probabilities with new info: P(A|B)=P(B|A)*P(A)/P(B). Beginners can imagine guessing weather probability given cloud cover.
<!-- Example: conditional probability --> P_A = 0.3 P_B_given_A = 0.8 P_B = 0.5 P_A_given_B = (P_B_given_A * P_A) / P_B print("P(A|B):", P_A_given_B)
Distributions describe data patterns. Normal is bell-shaped, uniform is equal probability, binomial is for yes/no trials. Beginners can imagine rolling dice (uniform) or exam scores (normal).
<!-- Example: distributions --> import numpy as np normal = np.random.normal(0,1,5) uniform = np.random.uniform(0,1,5) binomial = np.random.binomial(1,0.5,5) print("Normal:", normal) print("Uniform:", uniform) print("Binomial:", binomial)
Hypothesis testing checks if data supports a claim. P-value measures chance that observed data occurs under null hypothesis. Beginners can imagine testing if a coin is fair.
<!-- Example: simple hypothesis test --> from scipy import stats data = [1,2,3,4,5] t_stat, p_val = stats.ttest_1samp(data, 3) print("p-value:", p_val)
Z-test is for large samples, t-test for small samples to compare means. Beginners can imagine checking if students from two classes have different average scores.
<!-- Example: t-test --> group1 = [70,75,80] group2 = [85,90,88] t_stat, p_val = stats.ttest_ind(group1, group2) print("t-test p-value:", p_val)
Descriptive stats summarize data (mean, median), inferential stats make predictions or generalizations about a population from a sample. Beginners can imagine calculating average exam score vs predicting all students' grades.
<!-- Example: descriptive vs inferential --> mean_score = statistics.mean(data) print("Descriptive mean:", mean_score)
P(A) is probability of event A. P(A ∩ B) is probability of both events A and B. Beginners can imagine rolling dice: P(roll 4) or P(roll even and >2).
<!-- Example: probability basics --> P_A = 1/6 # chance of rolling 4 P_even = 3/6 P_even_and_gt2 = 2/6 print("P(A):", P_A, "P(A ∩ B):", P_even_and_gt2)
Random variables assign numbers to outcomes. Expected value is average outcome over many trials. Beginners can imagine expected dice roll value after many throws.
<!-- Example: expected value --> values = [1,2,3,4,5,6] prob = 1/6 expected_value = sum([v*prob for v in values]) print("Expected value:", expected_value)
<!-- Example: Linear Regression --> from sklearn.linear_model import LinearRegression X = [[1],[2],[3],[4]] y = [2,4,6,8] model = LinearRegression() model.fit(X, y) print("Prediction for 5:", model.predict([[5]]))
<!-- Example: Logistic Regression --> from sklearn.linear_model import LogisticRegression X = [[0],[1],[2],[3]] y = [0,0,1,1] model = LogisticRegression() model.fit(X, y) print("Prediction for 1:", model.predict([[1]]))
<!-- Example: Mean Squared Error --> from sklearn.metrics import mean_squared_error y_true = [2,4,6] y_pred = [2.1,3.9,6.2] print("MSE:", mean_squared_error(y_true, y_pred))
<!-- Example: Accuracy --> from sklearn.metrics import accuracy_score y_true = [0,1,1,0] y_pred = [0,1,0,0] print("Accuracy:", accuracy_score(y_true, y_pred))
<!-- Example: Decision Tree --> from sklearn.tree import DecisionTreeClassifier X = [[0],[1],[2],[3]] y = [0,0,1,1] model = DecisionTreeClassifier() model.fit(X, y) print("Prediction for 2:", model.predict([[2]]))
<!-- Example: Random Forest --> from sklearn.ensemble import RandomForestClassifier X = [[0],[1],[2],[3]] y = [0,0,1,1] model = RandomForestClassifier(n_estimators=5) model.fit(X, y) print("Prediction for 3:", model.predict([[3]]))
<!-- Example: KNN --> from sklearn.neighbors import KNeighborsClassifier X = [[0],[1],[2],[3]] y = [0,0,1,1] model = KNeighborsClassifier(n_neighbors=2) model.fit(X, y) print("Prediction for 2:", model.predict([[2]]))
<!-- Example: SVM --> from sklearn.svm import SVC X = [[0],[1],[2],[3]] y = [0,0,1,1] model = SVC() model.fit(X, y) print("Prediction for 1.5:", model.predict([[1.5]]))
<!-- Example: Naive Bayes --> from sklearn.naive_bayes import GaussianNB X = [[0],[1],[2],[3]] y = [0,0,1,1] model = GaussianNB() model.fit(X, y) print("Prediction for 2:", model.predict([[2]]))
<!-- Example: Ridge (L2) Regularization --> from sklearn.linear_model import Ridge X = [[1],[2],[3],[4]] y = [2,4,6,8] model = Ridge(alpha=1.0) model.fit(X, y) print("Prediction for 5:", model.predict([[5]]))
A perceptron is the simplest type of neural network. Beginners can imagine it like a tiny decision-maker that takes inputs, multiplies by weights, adds a bias, and outputs a decision (0 or 1).
<!-- Example: simple perceptron logic --> def perceptron(x1, x2): weight1, weight2 = 0.5, 0.5 bias = -0.5 total = x1*weight1 + x2*weight2 + bias return 1 if total > 0 else 0 print(perceptron(1,1))
Activation functions decide output of a neuron. Sigmoid outputs 0-1, ReLU outputs positive values, Tanh outputs -1 to 1, Softmax gives probabilities. Beginners can imagine these as "decision rules".
<!-- Example: activation functions --> import numpy as np def sigmoid(x): return 1/(1+np.exp(-x)) def relu(x): return max(0,x) def tanh(x): return np.tanh(x) print("Sigmoid(0):", sigmoid(0)) print("ReLU(-2):", relu(-2)) print("Tanh(0):", tanh(0))
Forward propagation passes inputs through neurons to produce output. Backward propagation adjusts weights using errors to improve predictions. Beginners can imagine testing an answer and learning from mistakes.
<!-- Example: concept illustration --> # Forward: sum input*weights x = 2; w = 0.5; b = 0.1 forward = x*w + b print("Forward output:", forward) # Backward: adjust weight error = 1 - forward w = w + 0.1*error print("Updated weight:", w)
Loss functions measure how wrong predictions are. MSE is for numbers, Cross-Entropy is for classes. Beginners can imagine calculating distance from the correct answer.
<!-- Example: MSE loss --> y_true = [2,3] y_pred = [2.5,2.8] mse = sum([(y_true[i]-y_pred[i])**2 for i in range(2)])/2 print("MSE:", mse)
Learning rate controls how fast weights are updated. Too high → overshoot, too low → slow learning. Beginners can imagine adjusting step size while walking to a target.
<!-- Example: simple learning rate update --> weight = 0.5 error = 0.2 lr = 0.1 weight = weight - lr*error print("Updated weight:", weight)
Regularization prevents overfitting. Dropout randomly ignores neurons during training. Weight decay reduces large weights. Beginners can imagine limiting reliance on any single neuron.
<!-- Example: dropout concept --> import random neurons = [1,2,3,4] active_neurons = [n for n in neurons if random.random() > 0.5] print("Active neurons:", active_neurons)
CNNs handle images by detecting patterns in pixels. RNNs/LSTMs handle sequences like text or time series. Beginners can imagine CNNs as "pattern spotters" and RNNs as "remembering order".
<!-- Example: conceptual code > # CNN: image input 28x28 pixels image = np.random.rand(28,28) # RNN: sequence input sequence = [1,2,3,4] print("Image shape:", image.shape, "Sequence:", sequence)
Transfer learning uses a model trained on one task for a new task. Beginners can imagine learning a new language using knowledge of another language. It saves training time.
<!-- Example: concept --> # Pretend we have a pre-trained model pretrained_model = "trained on cats images" new_task = "dog images" print("Using", pretrained_model, "for", new_task)
GANs have two networks: generator creates fake data, discriminator detects real vs fake. Beginners can imagine a counterfeiter and a police checking fake bills. GANs make realistic images.
<!-- Example: GAN concept --> generator_output = "fake image" discriminator_check = "real or fake?" print("Generator output:", generator_output, "Discriminator:", discriminator_check)
Hyperparameters are settings like learning rate, number of neurons. Adjusting them improves model performance. Beginners can imagine changing oven temperature to bake better cake.
<!-- Example: tuning learning rate --> learning_rate = 0.1 print("Try smaller lr:", learning_rate/10) print("Try larger lr:", learning_rate*2)
<!-- Example: basic RL concept --> state = 0 action = 'move_right' reward = 1 print("State:", state, "Action:", action, "Reward:", reward)
<!-- Example: Q-table concept --> Q = {} state = 0 action = 'right' Q[(state, action)] = 0 Q[(state, action)] += 1 # update print("Q-value:", Q)
<!-- Example: policy gradient idea --> action_prob = {'left':0.5,'right':0.5} action_prob['right'] += 0.1 # increase probability of better action print("Updated action probabilities:", action_prob)
<!-- Example: actor-critic concept --> actor = 'suggest action' critic = 'evaluate action' print(actor, "->", critic)
<!-- Example: simple choice --> import random action = random.choice(['explore','exploit']) print("Agent action:", action)
<!-- Example: RL application concept --> application = "robot learns to pick objects" print("RL application example:", application)
<!-- Example: OpenAI Gym concept --> import gym env = gym.make('CartPole-v1') state = env.reset() print("Initial state:", state)
<!-- Example: reward shaping --> reward = 0 reward += 1 # small step reward print("Shaped reward:", reward)
<!-- Example: value function --> V = {} state = 0 V[state] = 10 # estimated value print("Value of state:", V[state])
<!-- Example: deep RL concept --> from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense model = Sequential([Dense(10, input_shape=(4,), activation='relu'), Dense(2)]) print("Deep RL neural network created")
Splitting data into training and testing ensures the model learns from one set and is evaluated on another. Cross-validation repeats this process in multiple ways to get more reliable results. Beginners can imagine checking homework answers with different sample questions.
<!-- Example: train/test split --> from sklearn.model_selection import train_test_split X = [[1],[2],[3],[4]]; y = [0,1,1,0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) print("Train X:", X_train, "Test X:", X_test)
A confusion matrix shows prediction results: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Beginners can imagine checking correct vs wrong predictions in a simple table.
<!-- Example: confusion matrix --> from sklearn.metrics import confusion_matrix y_true = [0,1,1,0]; y_pred = [0,1,0,0] cm = confusion_matrix(y_true, y_pred) print("Confusion Matrix:\\n", cm)
ROC curve shows trade-off between true positive rate and false positive rate. AUC measures area under curve. Beginners can imagine plotting sensitivity vs error rate to check classifier quality.
<!-- Example: ROC concept --> from sklearn.metrics import roc_auc_score y_true = [0,1,1,0]; y_score = [0.1,0.9,0.4,0.3] auc = roc_auc_score(y_true, y_score) print("AUC:", auc)
Hyperparameters are settings like tree depth or learning rate. GridSearchCV tries all combinations, RandomSearch tries random combinations to find the best. Beginners can imagine testing different oven temperatures to bake the perfect cake.
<!-- Example: GridSearchCV concept --> from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression params = {'C':[0.1,1,10]} model = GridSearchCV(LogisticRegression(), param_grid=params) print("GridSearch ready for tuning")
Bias is error from overly simple model, variance is error from overly complex model. Beginners can imagine underfitting vs overfitting a line to points. Visualization helps see balance.
<!-- Example: concept illustration --> bias_error = 0.2 variance_error = 0.3 total_error = bias_error + variance_error print("Total Error:", total_error)
Learning curves show how training and validation errors change as model learns. Beginners can imagine tracking scores while practicing exercises to see improvement.
<!-- Example: learning curve concept --> train_error = [0.5,0.3,0.2] val_error = [0.6,0.35,0.25] print("Training errors:", train_error) print("Validation errors:", val_error)
Saving a trained model lets you reuse it later without retraining. Beginners can imagine storing a solved homework to check answers later.
<!-- Example: save and load model --> import joblib model = LogisticRegression() # assume model is trained joblib.dump(model, "model.pkl") loaded_model = joblib.load("model.pkl") print("Model loaded successfully")
Deployment allows your model to serve predictions via a web interface or API. Beginners can imagine creating a small webpage where users input data and get predictions.
<!-- Example: Flask concept --> from flask import Flask app = Flask(__name__) @app.route("/") def home(): return "Model is ready!" print("Flask app setup done")
Monitoring tracks model performance in real-time, detecting issues like drift or errors. Beginners can imagine checking a weather app’s predictions daily to see accuracy.
<!-- Example: monitoring concept --> predictions = [0,1,1,0] actuals = [0,1,0,0] accuracy = sum([pred==act for pred,act in zip(predictions,actuals)])/len(actuals) print("Real-time accuracy:", accuracy)
Edge deployment runs models on devices like phones or Raspberry Pi. Beginners can imagine having a small AI that works offline without needing the internet.
<!-- Example: concept only > model_format = "TensorFlow Lite" device = "Raspberry Pi" print("Deploying", model_format, "model to", device)
> from sklearn.metrics import mean_squared_error, mean_absolute_error > y_true = [3, 5, 7] > y_pred = [2.5, 5, 6.8] > print("MSE:", mean_squared_error(y_true, y_pred)) > print("RMSE:", mean_squared_error(y_true, y_pred, squared=False)) > print("MAE:", mean_absolute_error(y_true, y_pred))
> from sklearn.metrics import accuracy_score, precision_score, recall_score > y_true = [0,1,1,0] > y_pred = [0,1,0,0] > print("Accuracy:", accuracy_score(y_true, y_pred)) > print("Precision:", precision_score(y_true, y_pred)) > print("Recall:", recall_score(y_true, y_pred))
> from sklearn.metrics import f1_score > print("F1 score:", f1_score(y_true, y_pred))
> from sklearn.metrics import roc_auc_score > y_true = [0,1,1,0] > y_prob = [0.2,0.8,0.4,0.1] > print("AUC:", roc_auc_score(y_true, y_prob))
> from sklearn.metrics import log_loss > y_true = [0,1,1,0] > y_prob = [0.1,0.9,0.8,0.2] > print("Log Loss:", log_loss(y_true, y_prob))
> from sklearn.model_selection import cross_val_score > from sklearn.linear_model import LogisticRegression > X = [[1],[2],[3],[4]] > y = [0,0,1,1] > model = LogisticRegression() > print("CV Scores:", cross_val_score(model, X, y, cv=2))
> from sklearn.tree import DecisionTreeClassifier > X = [[1],[2],[3],[4]] > y = [0,0,1,1] > model = DecisionTreeClassifier(max_depth=1) # shallow tree = underfit > model.fit(X, y) > print("Predictions:", model.predict([[2.5]]))
> # Shallow tree = high bias, deep tree = high variance > model_shallow = DecisionTreeClassifier(max_depth=1) > model_deep = DecisionTreeClassifier(max_depth=10) > model_shallow.fit(X, y); model_deep.fit(X, y) > print("Shallow:", model_shallow.predict([[2.5]])) > print("Deep:", model_deep.predict([[2.5]]))
> from sklearn.linear_model import LogisticRegression > from sklearn.tree import DecisionTreeClassifier > model1 = LogisticRegression(); model2 = DecisionTreeClassifier() > model1.fit(X, y); model2.fit(X, y) > print("LogReg:", model1.predict([[2.5]])) > print("Tree:", model2.predict([[2.5]]))
> from sklearn.model_selection import train_test_split > X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) > model = LogisticRegression() > model.fit(X_train, y_train) > y_pred = model.predict(X_test) > print("Test Predictions:", y_pred)
> from sklearn.ensemble import BaggingClassifier > from sklearn.tree import DecisionTreeClassifier > X = [[0],[1],[2],[3]]; y = [0,0,1,1] > model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=3) > model.fit(X, y) > print("Bagging prediction:", model.predict([[1.5]]))
> from sklearn.ensemble import RandomForestClassifier > model = RandomForestClassifier(n_estimators=5) > model.fit(X, y) > print("Random Forest prediction:", model.predict([[2]]))
> # Boosting concept demonstration > # Each new model focuses on previous mistakes (example shown with AdaBoost below)
> from sklearn.ensemble import AdaBoostClassifier > model = AdaBoostClassifier(n_estimators=5) > model.fit(X, y) > print("AdaBoost prediction:", model.predict([[1]]))
> from sklearn.ensemble import GradientBoostingClassifier > model = GradientBoostingClassifier(n_estimators=5) > model.fit(X, y) > print("Gradient Boosting prediction:", model.predict([[0.5]]))
> import xgboost as xgb > model = xgb.XGBClassifier(n_estimators=5, use_label_encoder=False, eval_metric='logloss') > model.fit(X, y) > print("XGBoost prediction:", model.predict([[2]]))
> import lightgbm as lgb > model = lgb.LGBMClassifier(n_estimators=5) > model.fit(X, y) > print("LightGBM prediction:", model.predict([[1.5]]))
> from sklearn.ensemble import StackingClassifier > from sklearn.linear_model import LogisticRegression > estimators = [('rf', RandomForestClassifier(n_estimators=3)), ('dt', DecisionTreeClassifier())] > model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression()) > model.fit(X, y) > print("Stacking prediction:", model.predict([[1]]))
> from sklearn.ensemble import VotingClassifier > model1 = LogisticRegression(); model2 = DecisionTreeClassifier() > voting_model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard') > voting_model.fit(X, y) > print("Voting prediction:", voting_model.predict([[2]]))
> # Example idea: Combine Random Forest + Gradient Boosting to detect fraud > # Train both models, average predictions for final output > rf_pred = model.predict([[1]]) > gb_pred = GradientBoostingClassifier(n_estimators=5).fit(X, y).predict([[1]]) > print("Ensemble average prediction:", (rf_pred[0] + gb_pred[0]) / 2)
Bagging trains multiple models on random subsets of data and averages their predictions to reduce errors. Beginners can imagine asking many friends for a guess and averaging results.
<!-- Bagging example --> from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier model = BaggingClassifier(DecisionTreeClassifier(), n_estimators=5) print("Bagging model ready")
Random Forest uses bagging but also picks random features for each tree. Beginners can imagine a forest of trees, each giving different opinions and voting.
<!-- Random Forest example --> from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=5) print("Random Forest model ready")
Boosting trains models one after another, focusing on previous errors. Beginners can imagine improving guesses step by step based on mistakes.
<!-- Conceptual boosting --> # Imagine 3 weak models improving step by step errors = [0.3,0.2,0.1] total_error = sum(errors) print("Boosting reduces error to:", total_error)
GBM is a type of boosting using gradient descent to minimize errors. Beginners can imagine climbing a slope to reach minimum error using small steps.
<!-- GBM example --> from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(n_estimators=5) print("GBM model ready")
XGBoost is a faster, optimized version of gradient boosting. Beginners can imagine a faster car climbing the error slope efficiently.
<!-- XGBoost example --> import xgboost as xgb model = xgb.XGBClassifier(n_estimators=5) print("XGBoost model ready")
LightGBM handles big datasets quickly by using histogram-based techniques. Beginners can imagine summarizing data quickly to make predictions faster.
<!-- LightGBM example --> import lightgbm as lgb model = lgb.LGBMClassifier(n_estimators=5) print("LightGBM model ready")
CatBoost is designed to handle categorical variables efficiently. Beginners can imagine a tool that understands text labels without converting them manually.
<!-- CatBoost example --> from catboost import CatBoostClassifier model = CatBoostClassifier(iterations=5, verbose=0) print("CatBoost model ready")
Stacking uses predictions from several models as input to a final model. Beginners can imagine multiple friends guessing and a leader combining their guesses for final answer.
<!-- Concept: stacking --> preds_model1 = [0,1] preds_model2 = [1,1] final_input = list(zip(preds_model1, preds_model2)) print("Stacking input:", final_input)
Voting classifiers combine predictions from multiple models. Majority vote chooses the most common prediction. Weighted vote gives more importance to better models. Beginners can imagine a group decision.
<!-- Voting concept --> votes = [0,1,1] final_vote = 1 if votes.count(1) > votes.count(0) else 0 print("Final vote:", final_vote)
Cross-validation checks ensemble performance on different data splits. Beginners can imagine testing your solution multiple times to be confident it works.
<!-- Cross-validation concept --> from sklearn.model_selection import cross_val_score # model assumed defined print("Cross-validation ready")
Stationarity means data's properties don’t change over time. ADF test checks this. Beginners can imagine checking if temperature patterns repeat consistently.
<!-- ADF test example --> from statsmodels.tsa.stattools import adfuller data = [1,2,3,4,5] result = adfuller(data) print("ADF p-value:", result[1])
Differencing subtracts previous value from current to remove trend. Beginners can imagine looking at daily temperature changes instead of absolute temperatures.
<!-- Differencing example --> diff = [data[i]-data[i-1] for i in range(1,len(data))] print("Differenced data:", diff)
These plots show how data points relate to past values. Beginners can imagine checking if yesterday's stock price affects today’s price.
<!-- Concept example --> # Conceptual example print("Autocorrelation check: data compared to past values")
AR (AutoRegressive) uses past values, MA (Moving Average) uses past errors, ARMA combines both. Beginners can imagine predicting today’s value using past info.
<!-- ARMA concept --> print("ARMA model predicts next value using past values and errors")
ARIMA models include differencing for trends (Integrated). Beginners can imagine adjusting for trends before predicting future values.
<!-- ARIMA concept --> print("ARIMA model ready for trend-adjusted forecast")
SARIMA handles repeating patterns like seasons. Beginners can imagine predicting winter sales using last winter’s data.
<!-- SARIMA concept --> print("SARIMA model handles seasonal patterns")
Prophet helps forecast time series easily. Beginners can imagine quickly predicting future sales using simple Python commands.
<!-- Prophet example --> from prophet import Prophet print("Prophet library ready for forecasting")
LSTM is a type of RNN good for sequences. Beginners can imagine remembering past days’ values to predict future ones.
<!-- LSTM concept --> sequence = [1,2,3,4] print("LSTM input sequence:", sequence)
Sliding window creates features from past observations. Beginners can imagine using past 3 days temperatures to predict today.
<!-- Sliding window example --> window_size = 3 features = [sequence[i:i+window_size] for i in range(len(sequence)-window_size)] print("Sliding window features:", features)
Metrics like RMSE, MAE, MAPE measure prediction errors. Beginners can imagine checking how far predictions are from real values.
<!-- Evaluation metrics example --> import numpy as np y_true = [3,5,2] y_pred = [2.5,4.8,2.1] rmse = np.sqrt(np.mean([(yt-yp)**2 for yt,yp in zip(y_true,y_pred)])) mae = np.mean([abs(yt-yp) for yt,yp in zip(y_true,y_pred)]) print("RMSE:", rmse, "MAE:", mae)