This response will help you prepare for your midterm by detailing the topics covered in the course so far, aligning with the approximate breakdown provided for your exam.
### Midterm Exam Preparation Guide
The midterm exam will be 90 minutes long and consist of approximately 60 MCQ/TF/Short answer questions. All topics covered in the course thus far are fair game.
---
#### 1. Basic Python Questions (~ 7 questions)
This section will test your foundational understanding of Python.
- **Introduction to Python**: The course includes an introduction to Python to help you get comfortable with its usage.
- **Python Basics**: Key concepts include immutable and mutable types, functions and lambda expressions, and lists.
- **Numpy**: A library for numerical operations in Python, crucial for data analysis.
- **Python Dictionaries**: These are used to store data values in key:value pairs. You should know how to create, get, update, add, and remove items from dictionaries. For example, to get an item, you use `thisdict["model"]`, and to add, `thisdict["color"] = "red"`.
#### 2. Numpy and Pandas (~15 questions)
This section focuses on data manipulation and analysis using the Pandas library, which often builds on Numpy.
- **Numpy Series**: Pandas Series can be created from Python dictionaries. If the dictionary keys match the Series index values, the index has no effect.
- **Reading Data**: Pandas provides commands to read various data formats, including CSV (`pd.read_csv`), Excel (`pd.read_excel`), Stata (`pd.read_stata`), SAS (`pd.read_sas`), and HDF (`pd.read_hdf`). These commands often have optional arguments to fine-tune the import process.
- **Exploring DataFrames**:
- `df.head()`: Lists the first 'n' records (defaulting to 5).
- `df.tail()`: Lists the last 'n' rows.
- **DataFrame Data Types**: Pandas assigns specific data types to columns based on their content:
- `object`: The most general type, assigned if a column has mixed types (e.g., numbers and strings).
- `int64`: For numeric integers, with 64 referring to memory allocation.
- `float64`: For numeric characters with decimals. If a column contains numbers and NaNs, Pandas defaults to `float64`.
- `datetime64`, `timedelta[ns]`: For time data, useful for time series experiments.
- You can check a particular column's type using `df['column_name'].dtype` or all columns' types using `df.dtypes`.
- **DataFrame Attributes**: Python objects like DataFrames have attributes:
- `dtypes`: Lists column types.
- `columns`: Lists column names.
- `axes`: Lists row labels and column names.
- `ndim`: Number of dimensions.
- `size`: Number of elements.
- `shape`: A tuple representing dimensionality.
- `values`: Numpy representation of the data.
- **DataFrame Methods**: Unlike attributes, methods have parentheses.
- `describe()`: Generates descriptive statistics for numeric columns.
- `max()`, `min()`, `mean()`, `median()`, `std()`: Return max/min, mean/median, and standard deviation for numeric columns.
- `sample([n])`: Returns a random sample of the DataFrame.
- `dropna()`: Drops all records with missing values.
- **Selecting Columns**:
- Method 1: Subset the DataFrame using column name, e.g., `df['sex']`.
- Method 2: Use the column name as an attribute, e.g., `df.sex`.
- Note: If a column name conflicts with a Pandas DataFrame attribute (e.g., "rank"), Method 1 must be used.
- Use single brackets for a Series output and double brackets (`df[['rank', 'salary']]`) for a DataFrame output when selecting columns.
- **Group Aggregation (Groupby)**:
- Allows splitting data into groups based on criteria.
- You can calculate various statistics for each group, like `df.groupby('rank')[['salary']].mean()`.
- No grouping occurs until needed; passing `sort=False` can speed up the operation.
- **Filtering (Boolean Indexing)**:
- Used to subset data based on conditions, e.g., `df[df['salary'] > 120000]`.
- Boolean operators include `>, >=, <, <=, ==, !=`.
- **Slicing**:
- Selects a range of rows using `df[10:20]`. The first row has position 0, and the last value in the range is omitted.
- **`loc` and `iloc` Methods**:
- `loc`: Selects rows by their labels, e.g., `df_sub.loc[10:20, ['rank', 'sex', 'salary']]`.
- `iloc`: Selects rows and/or columns by their positions. Examples: `df.iloc` (first row), `df.iloc[:, 0]` (first column), `df.iloc[1:3, 0:2]` (second through third rows and first two columns).
- **Sorting**:
- Sort data by column values using `df.sort_values(by='service')`. By default, it's ascending, and a new DataFrame is returned.
- Can sort by multiple columns with specified ascending/descending order, e.g., `df.sort_values(by=['service', 'salary'], ascending=[True, False])`.
- **Missing Values**:
- Represented as `NaN`.
- Identify rows with missing values using `flights[flights.isnull().any(axis=1)]`.
- Methods to handle: `dropna()` (drops observations), `dropna(how='all')` (drops if all cells are NA), `dropna(axis=1, how='all')` (drops column if all values missing), `dropna(thresh=5)` (drops rows with less than 5 non-missing values).
- `fillna(0)`: Replaces missing values with zeros.
- `isnull()`, `notnull()`: Return True for missing or non-missing values respectively.
- Missing values are treated as zero when summing data, result in `NaN` if all values are missing, and are excluded in `GroupBy` methods. Most descriptive statistics methods have a `skipna` option (True by default).
- **Aggregation Functions**: Compute summary statistics (e.g., sums, means, counts) per group. Common functions: `min`, `max`, `count`, `sum`, `prod`, `mean`, `median`, `mode`, `mad`, `std`, `var`. The `agg()` method is useful for computing multiple statistics per column.
- **Basic Descriptive Statistics**: `describe()` provides count, mean, std, min, quantiles, max. Other methods include `min`, `max`, `mean`, `median`, `mode`, `var`, `std`, `sem`, `skew`, `kurt`.
- **Graphics**: To show graphs inline in a Python notebook, use `%matplotlib inline`. Seaborn is a package built on matplotlib for statistical graphics, similar to R's ggplot2. It offers various plot types like `distplot`, `barplot`, `violinplot`, `jointplot`, `regplot`, `pairplot`, `boxplot`, `swarmplot`, `factorplot`.
- **Statistical Analysis**: `statsmodel` is used for regular analysis (linear regressions, ANOVA tests, hypothesis testing) with R-style formulas, while `scikit-learn` is more tailored for Machine Learning (kmeans, support vector machines, random forests).
- **Pandas Merge**: Used to combine DataFrames.
#### 3. Data Cleaning and Pre-processing (~15 questions)
This section covers preparing raw data for analysis and machine learning.
- **Goal of Data Preparation**: To ready data for tasks like data mining, categorization, and anomaly detection.
- **Key Topics**: Includes data cleaning (missing values, outliers, errors), transformation (normalization, standardization, log-transformations, scaling), encoding (categorical to numerical), handling date/time, text preprocessing, integration, normalization, noisy data, duplicates, discretization, skewed data, sampling, splitting, and aggregation.
- **Data Encoding (Quantization)**: Converting values (e.g., strings to integers, dates to days) so they can be used by mathematical machine learning algorithms.
- **Approaches**: Mapping every string to a number, or One-hot encoding.
- **One-hot Encoding**: A technique to represent categorical variables as binary vectors. It transforms each category into a new binary variable (dummy variable). For instance, "Color" with categories "Red," "Green," "Blue" becomes "Color_Red," "Color_Green," and "Color_Blue". This ensures that categorical information is captured without imposing an arbitrary numerical order.
- **Data Cleaning - Missing Values**:
- Missing values occur when a variable's value is not provided, often represented as `NaN` or `null` in Pandas.
- **How to Handle**:
- `dataframe.dropna()`: Drops rows with missing values, but may lead to significant data loss.
- **Imputation**: Replacing missing values with estimated values.
- **Mean/Median Imputation**: Replaces missing values with the mean or median of the available values in the same column, suitable for numerical data when missing values are assumed to be random.
- **Mode Imputation**: Replaces missing categorical values with the most frequent value (mode) in the same column.
- **K-nearest neighbors (KNN) Imputation**: Replaces missing values with values from the 'k' most similar records based on other variables. Requires prior quantization of values for categorical data.
- **Other approaches**: Regression imputation, Hot-deck imputation, Multiple imputation, and Domain-specific imputation.
- **Outliers**:
- Observations that are noticeably different or inconsistent with the general pattern of the data, potentially extremely high or low values.
- Some machine learning algorithms (e.g., decision trees, random forests, SVM with robust kernels) are inherently robust to outliers.
- **Handling Outliers**: The basic approach is to remove them.
- **Identifying Outliers**:
- **Standard Deviation (STD)**: Measures the average variability in your dataset. Low STD means data is clustered around the mean, high STD means data is spread out. To find outliers, calculate the mean and standard deviation, then set a threshold (typically 2 or 3 standard deviations from the mean) to identify points falling outside this range.
- **Local Outlier Factor (LOF)**: Measures how much a data point stands out from its local neighborhood. Higher LOF values indicate stronger outliers, while values close to 1 suggest similarity to neighbors. LOF is considered better at detecting anomalies than STD, especially when data patterns change over time (e.g., weekdays vs. weekends), as it considers neighborhood density. Outlier detection approaches like LOF can be used for anomaly detection in security applications, such as flagging suspicious logins.
#### 4. Introduction to ML +SVM (~15 questions)
This section covers the fundamentals of Machine Learning and dives into Support Vector Machines.
- **What is Machine Learning (ML)?**: The capability of AI systems to learn by extracting patterns from data, allowing them to learn from examples and experience without explicit programming. It involves automating automation and letting data program computers.
- **Growth and Applications**: ML is preferred for speech recognition, NLP, computer vision, medical outcomes, robot control, and computational biology. Its growth is fueled by improved algorithms, data capture, faster computers, and the complexity of software. ML offers benefits like powerful processing, better decision-making, quicker processing, and affordability.
- **Steps in an ML Project**:
1. Defining a Problem.
2. Preparing Data (collection of raw data, cleaning, feature selection). Feature selection involves choosing relevant variables for model construction.
3. Evaluating Algorithms.
4. Improving Results.
5. Presenting Results.
- **Support Vector Machines (SVM)**: A machine learning algorithm.
- **Intuition**: SVM aims to find a "good" separator (hyperplane) that maximizes the margin between different classes of data points. Support vectors are the data points closest to the hyperplane that influence its position.
- **Non-Linear Separability**:
- **Slack Variables**: Introduced when data is not perfectly linearly separable, allowing for some misclassification to find an optimal solution by minimizing `|w|^2 + C * sum(xi)`.
- **Kernel Methods**: Used when the surface is non-linear. They map the input data into a new, higher-dimensional feature space (`(x)`) where a linear separator might exist. The "kernel trick" allows implicitly computing the dot product in this higher dimension (`K(x1,x2) = < (x1), (x2)>`) efficiently without explicitly calculating `(x)` itself.
- **Types of Kernels**: Common kernels include the Dot product kernel, Polynomial kernel (`K(x1,x2) = < x1,x2 >d`), Gaussian kernel (`exp(-| x1-x2 |2/22)` also known as Radial basis functions), and Sigmoid kernel (`tanh(< x1,x2 > + )`). A function is a kernel if its Gram matrix (`Gij = K(xi, xj)`) is positive definite.
- **Overtraining/Overfitting**: A common problem where a model learns the training data too well but fails to classify unseen examples correctly. It can be shown that the portion of unseen data that will be misclassified is bounded by the number of support vectors divided by the number of training examples. Ockham's razor principle suggests simpler systems (fewer support vectors) are better.
- **Features**: Characteristics or variables of an object used to represent it in SVM. Features should be nominal (digits) for mathematical algorithms. Data cleaning is often necessary to quantize features, e.g., mapping permissions to 0s and 1s, or attack impact to 0/1.
- **Class**: The group or category an object belongs to in machine learning (e.g., benign or malware). While often binary, multi-class ML algorithms exist.
- **Data Preparation for SVM**:
- **Clean the Data**: Quantize feature values, fill in missing values, and delete outliers.
- **Split the Data**: Divide the dataset into training and testing sets (typically 70% for training, 30% for testing) to evaluate the model's performance on unseen data.
- **Steps for Modeling with SVM**:
1. Select features.
2. Clean data (quantize features).
3. Identify the class variable.
4. Split data into training/testing sets.
5. Apply ML algorithms (e.g., SVM) to train the model.
6. Evaluate the model on the testing dataset.
- **SVM with Python**: Uses libraries like `pandas` and `sklearn` (specifically `sklearn.model_selection`, `sklearn.svm`, `sklearn.metrics`). You can define features (X) and the class (y), split the data using `train_test_split`, create an `SVC` classifier, train it with `fit()`, and make predictions with `predict()`. You can specify different kernels for the SVM classifier (e.g., `kernel='linear'`, `'poly'`, `'rbf'`, `'sigmoid'`).
- **Evaluation Metrics**:
- `accuracy_score`: Evaluates the overall accuracy of the classifier.
- **Confusion Matrix**: A table used to describe the performance of a classification model on a set of test data for which the true values are known. It shows:
- **True Positive (TP)**: Correctly classified positive instances.
- **False Negative (FN)**: Positive instances incorrectly classified as negative.
- **False Positive (FP)**: Negative instances incorrectly classified as positive.
- **True Negative (TN)**: Correctly classified negative instances.
- `N = Np + Nn` (total instances) where `Np = TP + FN` (total positive) and `Nn = FP + TN` (total negative).
- For a perfect classifier, FP = FN = 0.
- **Accuracy**: `(TP + TN) / (TP + FP + FN + TN)`.
- **Precision**: `TP / (TP + FP)` (How precise the detection is).
- **Recall**: `TP / (TP + FN)` (How good the model is at detecting actual positives, or coverage). High precision doesn't always mean high recall, especially if the model is biased.
- **F1 Score**: `2 * (Recall * Precision) / (Recall + Precision) = 2TP / (2TP + FP + FN)`. It's the harmonic mean of recall and precision, ensuring both are reasonably high for a good score.
- **k-Fold Cross-Validation**: A technique to evaluate model performance more robustly. For example, 10-fold cross-validation involves splitting the data into 10 "folds," training the model 10 times (each time using a different fold as the test set and the remaining 9 as the training set), and then averaging the performance scores. `sklearn.model_selection.cross_val_score` can be used for this.
#### 5. Supervised and Unsupervised ML (~8 questions)
This section covers the fundamental types of machine learning paradigms.
- **Types of Learning**: Machine learning broadly categorized into:
- **Supervised Learning**: Training data includes desired outputs (labeled data). The system learns a mapping function (`Y = f(X)`) from input variables (x) to output variables (Y). It's similar to human learning under a teacher's supervision.
- **Uses**: Prediction of future cases, knowledge extraction, data compression, and outlier detection.
- **Categories**:
- **Classification**: The output variable is a category or group (e.g., "spam" or "no spam", "black" or "white"). Applications include face recognition, character recognition, medical diagnosis, and web advertising.
- **Regression**: The output variable is a real value (e.g., "price" or "height"). Example: predicting the price of a used car.
- **Common Algorithms**: Nearest Neighbour, Naive Bayes, Decision Trees, Linear Regression, Support Vector Machines (SVM), Neural Networks.
- **Unsupervised Learning**: Training data does _not_ include desired outputs; algorithms discover interesting structures in the data on their own. There are no given correct answers, and the machine finds the patterns itself.
- **Goal**: Learning "what normally happens".
- **Categories**:
- **Clustering**: Grouping similar instances based on inherent groupings in the data (e.g., grouping customers by purchasing behavior).
- **Association**: Discovering rules that describe large portions of data, such as "people that buy X also tend to buy Y" (e.g., market-basket analysis).
- **Other applications**: Summarization, Image compression (color quantization), Bioinformatics.
- **Semi-supervised Learning**: Uses a large amount of unlabeled data for training and a small amount of labeled data for testing. It's applied when acquiring a fully labeled dataset is expensive.
- **Reinforcement Learning**: No supervised output, but delayed rewards. A program interacts with a dynamic environment, receiving feedback (rewards/punishments) to learn how to achieve a goal through trial and error. Examples include game playing, robots in mazes, self-driving cars, and AlphaGo.
- **Why Machine Learning with Python**: Python is popular for research and production systems due to its vast ecosystem of modules, packages, and libraries. Libraries like NumPy, Pandas, SciPy, Scikit-Learn, and Matplotlib are extensively used in data science, data analysis, and creating scalable ML algorithms. Python implements common ML techniques such as Classification, Regression, Recommendation, and Clustering.
Good luck with your midterm!