init

2025-09-29 14:22:54 -05:00 · 2025-09-29 14:22:54 -05:00 · 67e1639548
commit 67e1639548
13 changed files with 94775 additions and 0 deletions
--- a/adult-clean/test_clean.csv
+++ b/adult-clean/test_clean.csv
--- a/adult-clean/train_clean.csv
+++ b/adult-clean/train_clean.csv
--- a/adult.zip
+++ b/adult.zip
--- a/adult/Index
+++ b/adult/Index
@ -0,0 +1,6 @@
+Index of adult
+
+02 Dec 1996      140 Index
+10 Aug 1996  3974305 adult.data
+10 Aug 1996     4267 adult.names
+10 Aug 1996  2003153 adult.test
--- a/adult/adult.data
+++ b/adult/adult.data
--- a/adult/adult.names
+++ b/adult/adult.names
@ -0,0 +1,110 @@
+| This data was extracted from the census bureau database found at
+| http://www.census.gov/ftp/pub/DES/www/welcome.html
+| Donor: Ronny Kohavi and Barry Becker,
+|        Data Mining and Visualization
+|        Silicon Graphics.
+|        e-mail: ronnyk@sgi.com for questions.
+| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
+| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
+| 45222 if instances with unknown values are removed (train=30162, test=15060)
+| Duplicate or conflicting instances : 6
+| Class probabilities for adult.all file
+| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
+| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
+|
+| Extraction was done by Barry Becker from the 1994 Census database.  A set of
+|   reasonably clean records was extracted using the following conditions:
+|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
+|
+| Prediction task is to determine whether a person makes over 50K
+| a year.
+|
+| First cited in:
+| @inproceedings{kohavi-nbtree,
+|    author={Ron Kohavi},
+|    title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
+|           Decision-Tree Hybrid},
+|    booktitle={Proceedings of the Second International Conference on
+|               Knowledge Discovery and Data Mining},
+|    year = 1996,
+|    pages={to appear}}
+|
+| Error Accuracy reported as follows, after removal of unknowns from
+|    train/test sets):
+|    C4.5       : 84.46+-0.30
+|    Naive-Bayes: 83.88+-0.30
+|    NBTree     : 85.90+-0.28
+|
+|
+| Following algorithms were later run with the following error rates,
+|    all after removal of unknowns and using the original train/test split.
+|    All these numbers are straight runs using MLC++ with default values.
+|
+|    Algorithm               Error
+| -- ----------------        -----
+| 1  C4.5                    15.54
+| 2  C4.5-auto               14.46
+| 3  C4.5 rules              14.94
+| 4  Voted ID3 (0.6)         15.64
+| 5  Voted ID3 (0.8)         16.47
+| 6  T2                      16.84
+| 7  1R                      19.54
+| 8  NBTree                  14.10
+| 9  CN2                     16.00
+| 10 HOODG                   14.82
+| 11 FSS Naive Bayes         14.05
+| 12 IDTM (Decision table)   14.46
+| 13 Naive-Bayes             16.12
+| 14 Nearest-neighbor (1)    21.42
+| 15 Nearest-neighbor (3)    20.35
+| 16 OC1                     15.04
+| 17 Pebls                   Crashed.  Unknown why (bounds WERE increased)
+|
+| Conversion of original data as follows:
+| 1. Discretized agrossincome into two ranges with threshold 50,000.
+| 2. Convert U.S. to US to avoid periods.
+| 3. Convert Unknown to "?"
+| 4. Run MLC++ GenCVFiles to generate data,test.
+|
+| Description of fnlwgt (final weight)
+|
+| The weights on the CPS files are controlled to independent estimates of the
+| civilian noninstitutional population of the US.  These are prepared monthly
+| for us by Population Division here at the Census Bureau.  We use 3 sets of
+| controls.
+|  These are:
+|          1.  A single cell estimate of the population 16+ for each state.
+|          2.  Controls for Hispanic Origin by age and sex.
+|          3.  Controls by Race, age and sex.
+|
+| We use all three sets of controls in our weighting program and "rake" through
+| them 6 times so that by the end we come back to all the controls we used.
+|
+| The term estimate refers to population totals derived from CPS by creating
+| "weighted tallies" of any specified socio-economic characteristics of the
+| population.
+|
+| People with similar demographic characteristics should have
+| similar weights.  There is one important caveat to remember
+| about this statement.  That is that since the CPS sample is
+| actually a collection of 51 state samples, each with its own
+| probability of selection, the statement only applies within
+| state.
+
+
+>50K, <=50K.
+
+age: continuous.
+workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
+fnlwgt: continuous.
+education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
+education-num: continuous.
+marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
+occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
+relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
+race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
+sex: Female, Male.
+capital-gain: continuous.
+capital-loss: continuous.
+hours-per-week: continuous.
+native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
--- a/adult/adult.test
+++ b/adult/adult.test
--- a/adult/old.adult.names
+++ b/adult/old.adult.names
@ -0,0 +1,89 @@
+1. Title of Database: adult
+2. Sources:
+   (a) Original owners of database (name/phone/snail address/email address)
+       US Census Bureau.
+   (b) Donor of database (name/phone/snail address/email address)
+       Ronny Kohavi and Barry Becker, 
+       Data Mining and Visualization
+       Silicon Graphics.
+       e-mail: ronnyk@sgi.com
+   (c) Date received (databases may change over time without name change!)
+       05/19/96
+3. Past Usage:
+   (a) Complete reference of article where it was described/used
+        @inproceedings{kohavi-nbtree,
+           author={Ron Kohavi},
+           title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a 
+                  Decision-Tree Hybrid},
+           booktitle={Proceedings of the Second International Conference on
+                      Knowledge Discovery and Data Mining},
+           year = 1996,
+           pages={to appear}}
+   (b) Indication of what attribute(s) were being predicted 
+       Salary greater or less than 50,000.
+   (b) Indication of study's results (i.e. Is it a good domain to use?)
+       Hard domain with a nice number of records.
+       The following results obtained using MLC++ with default settings
+       for the algorithms mentioned below.
+        
+           Algorithm               Error
+        -- ----------------        -----
+        1  C4.5                    15.54
+        2  C4.5-auto               14.46
+        3  C4.5 rules              14.94
+        4  Voted ID3 (0.6)         15.64
+        5  Voted ID3 (0.8)         16.47
+        6  T2                      16.84
+        7  1R                      19.54
+        8  NBTree                  14.10
+        9  CN2                     16.00
+        10 HOODG                   14.82
+        11 FSS Naive Bayes         14.05
+        12 IDTM (Decision table)   14.46
+        13 Naive-Bayes             16.12
+        14 Nearest-neighbor (1)    21.42
+        15 Nearest-neighbor (3)    20.35
+        16 OC1                     15.04
+        17 Pebls                   Crashed.  Unknown why (bounds WERE increased)
+
+4. Relevant Information Paragraph:
+   Extraction was done by Barry Becker from the 1994 Census database.  A set
+    of reasonably clean records was extracted using the following conditions:
+    ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
+
+5. Number of Instances
+   48842 instances, mix of continuous and discrete    (train=32561, test=16281)
+   45222 if instances with unknown values are removed (train=30162, test=15060)
+   Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
+
+6. Number of Attributes 
+   6 continuous, 8 nominal attributes.
+
+7. Attribute Information: 
+
+age: continuous.
+workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
+fnlwgt: continuous.
+education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
+education-num: continuous.
+marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
+occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
+relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
+race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
+sex: Female, Male.
+capital-gain: continuous.
+capital-loss: continuous.
+hours-per-week: continuous.
+native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
+class: >50K, <=50K
+
+8. Missing Attribute Values: 
+
+   7% have missing values.
+
+9. Class Distribution: 
+
+ Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
+ Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
+
+
--- a/part1.py
+++ b/part1.py
@ -0,0 +1,155 @@
+import pandas as pd
+from sklearn.preprocessing import OneHotEncoder
+from typing import Tuple
+import os
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.naive_bayes import BernoulliNB
+from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
+
+def process_data(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
+    """
+    Processes the adult dataset by cleaning, removing continuous attributes, and one-hot encoding.
+
+    Args:
+        train_path (str): The path to the training data file.
+        test_path (str): The path to the test data file.
+
+    Returns:
+        Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]: A tuple containing:
+            - X_train_encoded: Processed and one-hot encoded training features.
+            - y_train: Training labels.
+            - X_test_encoded: Processed and one-hot encoded test features.
+            - y_test: Test labels.
+    """
+    columns: list[str] = [
+        'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
+        'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
+        'hours-per-week', 'native-country', 'income'
+    ]
+
+    # Load datasets
+    df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
+    df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
+
+    # Remove rows with any missing values
+    df_train.dropna(inplace=True)
+    df_test.dropna(inplace=True)
+
+    # Define continuous attributes to remove
+    continuous_attributes: list[str] = [
+        'age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week'
+    ]
+
+    # Separate features and target
+    X_train: pd.DataFrame = df_train.drop(columns=['income'])
+    y_train: pd.Series = df_train['income'].str.replace('.', '', regex=False)
+    X_test: pd.DataFrame = df_test.drop(columns=['income'])
+    y_test: pd.Series = df_test['income'].str.replace('.', '', regex=False)
+
+    # Remove continuous attributes
+    X_train = X_train.drop(columns=continuous_attributes)
+    X_test = X_test.drop(columns=continuous_attributes)
+
+    # Identify categorical attributes for one-hot encoding
+    categorical_attributes: list[str] = X_train.columns.tolist()
+
+    # One-hot encode categorical attributes
+    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
+    X_train_encoded: pd.DataFrame = pd.DataFrame(encoder.fit_transform(X_train[categorical_attributes]), columns=encoder.get_feature_names_out(categorical_attributes))
+    X_test_encoded: pd.DataFrame = pd.DataFrame(encoder.transform(X_test[categorical_attributes]), columns=encoder.get_feature_names_out(categorical_attributes))
+
+    return X_train_encoded, y_train, X_test_encoded, y_test
+
+def evaluate_model(X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, y_test: pd.Series, model, model_name: str):
+    """
+    Trains and evaluates a given model, printing a detailed report.
+
+    Args:
+        X_train (pd.DataFrame): Training features.
+        y_train (pd.Series): Training labels.
+        X_test (pd.DataFrame): Test features.
+        y_test (pd.Series): Test labels.
+        model: The classifier model to evaluate.
+        model_name (str): The name of the model for reporting.
+    """
+    # Train the model
+    model.fit(X_train, y_train)
+
+    # Make predictions
+    y_pred = model.predict(X_test)
+
+    # Generate the classification report
+    report = classification_report(y_test, y_pred, output_dict=True)
+    
+    # Calculate confusion matrix to get TP and FP rates
+    # For binary classification: [[TN, FP], [FN, TP]]
+    # For multi-class, we calculate per class
+    cm = confusion_matrix(y_test, y_pred, labels=model.classes_)
+    
+    print(f"--- {model_name} Evaluation ---")
+    print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")
+    
+    for label in model.classes_:
+        # Get the index for the current class
+        class_idx = list(model.classes_).index(label)
+        
+        # TP is the diagonal element
+        tp = cm[class_idx, class_idx]
+        
+        # FP is the sum of the column for this class, excluding the TP
+        fp = cm[:, class_idx].sum() - tp
+        
+        # FN is the sum of the row for this class, excluding the TP
+        fn = cm[class_idx, :].sum() - tp
+        
+        # TN is the sum of all cells minus the TP, FP, and FN for this class
+        tn = cm.sum() - (tp + fp + fn)
+
+        # Rates
+        tp_rate = tp / (tp + fn) if (tp + fn) > 0 else 0  # Same as recall
+        fp_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
+
+        print(f"Class: {label}")
+        print(f"  TP Rate (Recall): {tp_rate:.4f}")
+        print(f"  FP Rate         : {fp_rate:.4f}")
+        print(f"  Precision       : {report[label]['precision']:.4f}")
+        print(f"  F1-Score        : {report[label]['f1-score']:.4f}")
+        print("-" * 20)
+
+if __name__ == '__main__':
+    train_file = 'adult/adult.data'
+    test_file = 'adult/adult.test'
+    output_dir = 'adult-clean'
+
+    # Create the output directory if it doesn't exist
+    os.makedirs(output_dir, exist_ok=True)
+
+    X_train, y_train, X_test, y_test = process_data(train_file, test_file)
+
+    # Reset index to align features and labels for concatenation
+    y_train = y_train.reset_index(drop=True)
+    X_train = X_train.reset_index(drop=True)
+    y_test = y_test.reset_index(drop=True)
+    X_test = X_test.reset_index(drop=True)
+
+    # Concatenate features and labels
+    train_cleaned = pd.concat([X_train, y_train], axis=1)
+    test_cleaned = pd.concat([X_test, y_test], axis=1)
+
+    # Save the cleaned data to new CSV files
+    train_cleaned.to_csv(os.path.join(output_dir, 'train_clean.csv'), index=False)
+    test_cleaned.to_csv(os.path.join(output_dir, 'test_clean.csv'), index=False)
+
+    print(f"Preprocessed data saved to '{output_dir}' directory.")
+    print(f"Training data shape: {train_cleaned.shape}")
+    print(f"Test data shape: {test_cleaned.shape}\n")
+
+    # --- Model Training and Evaluation ---
+
+    # 1. Decision Tree Classifier
+    dt_classifier = DecisionTreeClassifier(random_state=42)
+    evaluate_model(X_train, y_train, X_test, y_test, dt_classifier, "Decision Tree Classifier")
+
+    # 2. Naïve Bayesian Classifier
+    nb_classifier = BernoulliNB()
+    evaluate_model(X_train, y_train, X_test, y_test, nb_classifier, "Naïve Bayesian Classifier")
--- a/part2.py
+++ b/part2.py
@ -0,0 +1,132 @@
+import pandas as pd
+from sklearn.preprocessing import OneHotEncoder
+from sklearn.cluster import KMeans
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.metrics import accuracy_score
+from typing import Tuple, List
+import numpy as np
+
+def process_data_part2(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
+    """
+    Processes the adult dataset for part 2 requirements.
+    - Removes unknown values.
+    - Binarizes numerical attributes based on the mean.
+    - One-hot encodes categorical attributes.
+    """
+    columns: List[str] = [
+        'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
+        'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
+        'hours-per-week', 'native-country', 'income'
+    ]
+
+    # Load datasets
+    df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
+    df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
+
+    # Remove rows with any missing values
+    df_train.dropna(inplace=True)
+    df_test.dropna(inplace=True)
+
+    # Separate features and target, and clean target labels
+    X_train_raw = df_train.drop('income', axis=1)
+    y_train = df_train['income'].str.replace('.', '', regex=False)
+    X_test_raw = df_test.drop('income', axis=1)
+    y_test = df_test['income'].str.replace('.', '', regex=False)
+
+    # Identify numerical and categorical attributes
+    numerical_cols = X_train_raw.select_dtypes(include=np.number).columns.tolist()
+    categorical_cols = X_train_raw.select_dtypes(exclude=np.number).columns.tolist()
+
+    # --- Preprocessing ---
+
+    # 1. Binarize numerical attributes
+    X_train_numerical_processed = pd.DataFrame()
+    X_test_numerical_processed = pd.DataFrame()
+
+    for col in numerical_cols:
+        mean_val = X_train_raw[col].mean()
+        X_train_numerical_processed[col] = (X_train_raw[col] > mean_val).astype(int)
+        X_test_numerical_processed[col] = (X_test_raw[col] > mean_val).astype(int)
+
+    # 2. One-hot encode categorical attributes
+    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
+
+    X_train_categorical_processed = pd.DataFrame(
+        encoder.fit_transform(X_train_raw[categorical_cols]),
+        columns=encoder.get_feature_names_out(categorical_cols)
+    )
+    X_test_categorical_processed = pd.DataFrame(
+        encoder.transform(X_test_raw[categorical_cols]),
+        columns=encoder.get_feature_names_out(categorical_cols)
+    )
+
+    # Reset index to ensure concatenation works correctly
+    X_train_numerical_processed.index = X_train_categorical_processed.index
+    X_test_numerical_processed.index = X_test_categorical_processed.index
+    y_train.index = X_train_categorical_processed.index
+    y_test.index = X_test_categorical_processed.index
+
+    # 3. Combine processed features
+    X_train_processed = pd.concat([X_train_numerical_processed, X_train_categorical_processed], axis=1)
+    X_test_processed = pd.concat([X_test_numerical_processed, X_test_categorical_processed], axis=1)
+
+    return X_train_processed, y_train, X_test_processed, y_test
+
+def run_kmeans_clustering(X_train: pd.DataFrame, k_values: List[int]):
+    """
+    Runs K-Means clustering for different k values and reports centroids.
+    """
+    print("--- K-Means Clustering ---")
+    for k in k_values:
+        print(f"\nRunning K-Means with k={k}...")
+        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
+        kmeans.fit(X_train)
+
+        print(f"Centroids for k={k}:")
+        # Printing only the first 5 dimensions for brevity
+        print(pd.DataFrame(kmeans.cluster_centers_[:, :5], columns=X_train.columns[:5]))
+        print("-" * 20)
+
+def run_knn_classification(X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, y_test: pd.Series, k_values: List[int]):
+    """
+    Runs kNN classification on the last 10 test samples and reports accuracy.
+    """
+    print("\n--- k-Nearest Neighbors (kNN) Classification ---")
+
+    # Use the last 10 records from the test set
+    X_test_sample = X_test.tail(10)
+    y_test_sample = y_test.tail(10)
+
+    print(f"Predicting for the last {len(X_test_sample)} records of the test set.\n")
+
+    for k in k_values:
+        knn = KNeighborsClassifier(n_neighbors=k)
+        knn.fit(X_train, y_train)
+
+        y_pred_sample = knn.predict(X_test_sample)
+        accuracy = accuracy_score(y_test_sample, y_pred_sample)
+
+        print(f"kNN with k={k}:")
+        print(f"  Prediction Accuracy: {accuracy:.2f}")
+        print(f"  Predicted Labels: {y_pred_sample}")
+        print(f"  Actual Labels:    {y_test_sample.values}")
+        print("-" * 20)
+
+if __name__ == '__main__':
+    train_file = 'adult/adult.data'
+    test_file = 'adult/adult.test'
+
+    # Process data according to Part 2 requirements
+    X_train, y_train, X_test, y_test = process_data_part2(train_file, test_file)
+
+    print("Data processing complete.")
+    print(f"Training data shape: {X_train.shape}")
+    print(f"Test data shape: {X_test.shape}\n")
+
+    # Run K-Means Clustering
+    kmeans_k_values = [3, 5, 10]
+    run_kmeans_clustering(X_train, kmeans_k_values)
+
+    # Run kNN Classification
+    knn_k_values = [3, 5, 10]
+    run_knn_classification(X_train, y_train, X_test, y_test, knn_k_values)
--- a/part3.py
+++ b/part3.py
@ -0,0 +1,107 @@
+import pandas as pd
+from sklearn.preprocessing import OneHotEncoder
+from sklearn.svm import SVC
+from sklearn.metrics import accuracy_score
+from typing import Tuple, List
+import numpy as np
+
+# This is the same data processing function from Part 2
+def process_data_part2(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
+    """
+    Processes the adult dataset for part 2 requirements.
+    - Removes unknown values.
+    - Binarizes numerical attributes based on the mean.
+    - One-hot encodes categorical attributes.
+    """
+    columns: List[str] = [
+        'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
+        'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
+        'hours-per-week', 'native-country', 'income'
+    ]
+    
+    # Load datasets
+    df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
+    df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
+
+    # Remove rows with any missing values
+    df_train.dropna(inplace=True)
+    df_test.dropna(inplace=True)
+
+    # Separate features and target, and clean target labels
+    X_train_raw = df_train.drop('income', axis=1)
+    y_train = df_train['income'].str.replace('.', '', regex=False)
+    X_test_raw = df_test.drop('income', axis=1)
+    y_test = df_test['income'].str.replace('.', '', regex=False)
+
+    # Identify numerical and categorical attributes
+    numerical_cols = X_train_raw.select_dtypes(include=np.number).columns.tolist()
+    categorical_cols = X_train_raw.select_dtypes(exclude=np.number).columns.tolist()
+
+    # --- Preprocessing ---
+    
+    # 1. Binarize numerical attributes
+    X_train_numerical_processed = pd.DataFrame()
+    X_test_numerical_processed = pd.DataFrame()
+    
+    for col in numerical_cols:
+        mean_val = X_train_raw[col].mean()
+        X_train_numerical_processed[col] = (X_train_raw[col] > mean_val).astype(int)
+        X_test_numerical_processed[col] = (X_test_raw[col] > mean_val).astype(int)
+
+    # 2. One-hot encode categorical attributes
+    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
+    
+    X_train_categorical_processed = pd.DataFrame(
+        encoder.fit_transform(X_train_raw[categorical_cols]),
+        columns=encoder.get_feature_names_out(categorical_cols),
+        index=X_train_raw.index
+    )
+    X_test_categorical_processed = pd.DataFrame(
+        encoder.transform(X_test_raw[categorical_cols]),
+        columns=encoder.get_feature_names_out(categorical_cols),
+        index=X_test_raw.index
+    )
+    
+    # 3. Combine processed features
+    X_train_processed = pd.concat([X_train_numerical_processed, X_train_categorical_processed], axis=1)
+    X_test_processed = pd.concat([X_test_numerical_processed, X_test_categorical_processed], axis=1)
+
+    # Align y labels with the processed X dataframes
+    y_train = y_train.loc[X_train_processed.index]
+    y_test = y_test.loc[X_test_processed.index]
+
+    return X_train_processed, y_train, X_test_processed, y_test
+
+if __name__ == '__main__':
+    train_file = 'adult/adult.data'
+    test_file = 'adult/adult.test'
+
+    # Process data using the function from Part 2
+    X_train, y_train, X_test, y_test = process_data_part2(train_file, test_file)
+    
+    print("Data processing complete.")
+    print(f"Training data shape: {X_train.shape}")
+    print(f"Test data shape: {X_test.shape}\n")
+
+    # --- SVM Classifier ---
+    print("--- Support Vector Machine (SVM) Classifier ---")
+    
+    # Initialize SVM classifier. A linear kernel is often a good starting point.
+    # Using a smaller subset for training due to SVM's computational complexity
+    # For a full run, you would use the entire X_train, y_train
+    print("Training the SVM classifier... (This may take a few minutes)")
+    # Note: SVM can be slow on large datasets. For demonstration, you might
+    # sample your data, e.g., X_train.sample(n=5000, random_state=42)
+    svm_classifier = SVC(kernel='linear', random_state=42)
+    
+    # Train the model on the full training data
+    svm_classifier.fit(X_train, y_train)
+    
+    # Make predictions on the test data
+    print("Making predictions on the test data...")
+    y_pred = svm_classifier.predict(X_test)
+    
+    # Calculate and report the accuracy
+    accuracy = accuracy_score(y_test, y_pred)
+    
+    print(f"\nSVM Classifier Accuracy on Test Data: {accuracy:.4f}")
--- a/part4.py
+++ b/part4.py
@ -0,0 +1,105 @@
+import pandas as pd
+from sklearn.preprocessing import OneHotEncoder
+from sklearn.neural_network import MLPClassifier
+from sklearn.metrics import accuracy_score
+from typing import Tuple, List
+import numpy as np
+
+# This is the same data processing function from Part 2
+def process_data_part2(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
+    """
+    Processes the adult dataset for part 2 requirements.
+    - Removes unknown values.
+    - Binarizes numerical attributes based on the mean.
+    - One-hot encodes categorical attributes.
+    """
+    columns: List[str] = [
+        'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
+        'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
+        'hours-per-week', 'native-country', 'income'
+    ]
+    
+    # Load datasets
+    df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
+    df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
+
+    # Remove rows with any missing values
+    df_train.dropna(inplace=True)
+    df_test.dropna(inplace=True)
+
+    # Separate features and target, and clean target labels
+    X_train_raw = df_train.drop('income', axis=1)
+    y_train = df_train['income'].str.replace('.', '', regex=False)
+    X_test_raw = df_test.drop('income', axis=1)
+    y_test = df_test['income'].str.replace('.', '', regex=False)
+
+    # Identify numerical and categorical attributes
+    numerical_cols = X_train_raw.select_dtypes(include=np.number).columns.tolist()
+    categorical_cols = X_train_raw.select_dtypes(exclude=np.number).columns.tolist()
+
+    # --- Preprocessing ---
+    
+    # 1. Binarize numerical attributes
+    X_train_numerical_processed = pd.DataFrame()
+    X_test_numerical_processed = pd.DataFrame()
+    
+    for col in numerical_cols:
+        mean_val = X_train_raw[col].mean()
+        X_train_numerical_processed[col] = (X_train_raw[col] > mean_val).astype(int)
+        X_test_numerical_processed[col] = (X_test_raw[col] > mean_val).astype(int)
+
+    # 2. One-hot encode categorical attributes
+    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
+    
+    X_train_categorical_processed = pd.DataFrame(
+        encoder.fit_transform(X_train_raw[categorical_cols]),
+        columns=encoder.get_feature_names_out(categorical_cols),
+        index=X_train_raw.index
+    )
+    X_test_categorical_processed = pd.DataFrame(
+        encoder.transform(X_test_raw[categorical_cols]),
+        columns=encoder.get_feature_names_out(categorical_cols),
+        index=X_test_raw.index
+    )
+    
+    # 3. Combine processed features
+    X_train_processed = pd.concat([X_train_numerical_processed, X_train_categorical_processed], axis=1)
+    X_test_processed = pd.concat([X_test_numerical_processed, X_test_categorical_processed], axis=1)
+
+    # Align y labels with the processed X dataframes
+    y_train = y_train.loc[X_train_processed.index]
+    y_test = y_test.loc[X_test_processed.index]
+
+    return X_train_processed, y_train, X_test_processed, y_test
+
+if __name__ == '__main__':
+    train_file = 'adult/adult.data'
+    test_file = 'adult/adult.test'
+
+    # Process data using the function from Part 2
+    X_train, y_train, X_test, y_test = process_data_part2(train_file, test_file)
+    
+    print("Data processing complete.")
+    print(f"Training data shape: {X_train.shape}")
+    print(f"Test data shape: {X_test.shape}\n")
+
+    # --- Neural Network Classifier ---
+    print("--- Neural Network (MLP) Classifier ---")
+    
+    # Initialize the Multi-layer Perceptron classifier
+    # hidden_layer_sizes=(100,) means one hidden layer with 100 neurons.
+    # max_iter=500 to ensure the model has enough iterations to converge.
+    # random_state=42 for reproducibility.
+    nn_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
+    
+    print("Training the Neural Network classifier...")
+    nn_classifier.fit(X_train, y_train)
+    
+    # Make predictions on the test data
+    print("Making predictions on the test data...")
+    y_pred = nn_classifier.predict(X_test)
+    
+    # Calculate and report the accuracy
+    accuracy = accuracy_score(y_test, y_pred)
+    
+    print(f"\nNeural Network Classifier Accuracy on Test Data: {accuracy:.4f}")
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,2 @@
+pandas
+scikit-learn