init
This commit is contained in:
commit
67e1639548
15061
adult-clean/test_clean.csv
Normal file
15061
adult-clean/test_clean.csv
Normal file
File diff suppressed because it is too large
Load Diff
30163
adult-clean/train_clean.csv
Normal file
30163
adult-clean/train_clean.csv
Normal file
File diff suppressed because it is too large
Load Diff
6
adult/Index
Normal file
6
adult/Index
Normal file
@ -0,0 +1,6 @@
|
||||
Index of adult
|
||||
|
||||
02 Dec 1996 140 Index
|
||||
10 Aug 1996 3974305 adult.data
|
||||
10 Aug 1996 4267 adult.names
|
||||
10 Aug 1996 2003153 adult.test
|
||||
32562
adult/adult.data
Normal file
32562
adult/adult.data
Normal file
File diff suppressed because it is too large
Load Diff
110
adult/adult.names
Normal file
110
adult/adult.names
Normal file
@ -0,0 +1,110 @@
|
||||
| This data was extracted from the census bureau database found at
|
||||
| http://www.census.gov/ftp/pub/DES/www/welcome.html
|
||||
| Donor: Ronny Kohavi and Barry Becker,
|
||||
| Data Mining and Visualization
|
||||
| Silicon Graphics.
|
||||
| e-mail: ronnyk@sgi.com for questions.
|
||||
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
|
||||
| 48842 instances, mix of continuous and discrete (train=32561, test=16281)
|
||||
| 45222 if instances with unknown values are removed (train=30162, test=15060)
|
||||
| Duplicate or conflicting instances : 6
|
||||
| Class probabilities for adult.all file
|
||||
| Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
|
||||
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
||||
|
|
||||
| Extraction was done by Barry Becker from the 1994 Census database. A set of
|
||||
| reasonably clean records was extracted using the following conditions:
|
||||
| ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
||||
|
|
||||
| Prediction task is to determine whether a person makes over 50K
|
||||
| a year.
|
||||
|
|
||||
| First cited in:
|
||||
| @inproceedings{kohavi-nbtree,
|
||||
| author={Ron Kohavi},
|
||||
| title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
|
||||
| Decision-Tree Hybrid},
|
||||
| booktitle={Proceedings of the Second International Conference on
|
||||
| Knowledge Discovery and Data Mining},
|
||||
| year = 1996,
|
||||
| pages={to appear}}
|
||||
|
|
||||
| Error Accuracy reported as follows, after removal of unknowns from
|
||||
| train/test sets):
|
||||
| C4.5 : 84.46+-0.30
|
||||
| Naive-Bayes: 83.88+-0.30
|
||||
| NBTree : 85.90+-0.28
|
||||
|
|
||||
|
|
||||
| Following algorithms were later run with the following error rates,
|
||||
| all after removal of unknowns and using the original train/test split.
|
||||
| All these numbers are straight runs using MLC++ with default values.
|
||||
|
|
||||
| Algorithm Error
|
||||
| -- ---------------- -----
|
||||
| 1 C4.5 15.54
|
||||
| 2 C4.5-auto 14.46
|
||||
| 3 C4.5 rules 14.94
|
||||
| 4 Voted ID3 (0.6) 15.64
|
||||
| 5 Voted ID3 (0.8) 16.47
|
||||
| 6 T2 16.84
|
||||
| 7 1R 19.54
|
||||
| 8 NBTree 14.10
|
||||
| 9 CN2 16.00
|
||||
| 10 HOODG 14.82
|
||||
| 11 FSS Naive Bayes 14.05
|
||||
| 12 IDTM (Decision table) 14.46
|
||||
| 13 Naive-Bayes 16.12
|
||||
| 14 Nearest-neighbor (1) 21.42
|
||||
| 15 Nearest-neighbor (3) 20.35
|
||||
| 16 OC1 15.04
|
||||
| 17 Pebls Crashed. Unknown why (bounds WERE increased)
|
||||
|
|
||||
| Conversion of original data as follows:
|
||||
| 1. Discretized agrossincome into two ranges with threshold 50,000.
|
||||
| 2. Convert U.S. to US to avoid periods.
|
||||
| 3. Convert Unknown to "?"
|
||||
| 4. Run MLC++ GenCVFiles to generate data,test.
|
||||
|
|
||||
| Description of fnlwgt (final weight)
|
||||
|
|
||||
| The weights on the CPS files are controlled to independent estimates of the
|
||||
| civilian noninstitutional population of the US. These are prepared monthly
|
||||
| for us by Population Division here at the Census Bureau. We use 3 sets of
|
||||
| controls.
|
||||
| These are:
|
||||
| 1. A single cell estimate of the population 16+ for each state.
|
||||
| 2. Controls for Hispanic Origin by age and sex.
|
||||
| 3. Controls by Race, age and sex.
|
||||
|
|
||||
| We use all three sets of controls in our weighting program and "rake" through
|
||||
| them 6 times so that by the end we come back to all the controls we used.
|
||||
|
|
||||
| The term estimate refers to population totals derived from CPS by creating
|
||||
| "weighted tallies" of any specified socio-economic characteristics of the
|
||||
| population.
|
||||
|
|
||||
| People with similar demographic characteristics should have
|
||||
| similar weights. There is one important caveat to remember
|
||||
| about this statement. That is that since the CPS sample is
|
||||
| actually a collection of 51 state samples, each with its own
|
||||
| probability of selection, the statement only applies within
|
||||
| state.
|
||||
|
||||
|
||||
>50K, <=50K.
|
||||
|
||||
age: continuous.
|
||||
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
|
||||
fnlwgt: continuous.
|
||||
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
|
||||
education-num: continuous.
|
||||
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
|
||||
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
|
||||
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
|
||||
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
|
||||
sex: Female, Male.
|
||||
capital-gain: continuous.
|
||||
capital-loss: continuous.
|
||||
hours-per-week: continuous.
|
||||
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
|
||||
16283
adult/adult.test
Normal file
16283
adult/adult.test
Normal file
File diff suppressed because it is too large
Load Diff
89
adult/old.adult.names
Normal file
89
adult/old.adult.names
Normal file
@ -0,0 +1,89 @@
|
||||
1. Title of Database: adult
|
||||
2. Sources:
|
||||
(a) Original owners of database (name/phone/snail address/email address)
|
||||
US Census Bureau.
|
||||
(b) Donor of database (name/phone/snail address/email address)
|
||||
Ronny Kohavi and Barry Becker,
|
||||
Data Mining and Visualization
|
||||
Silicon Graphics.
|
||||
e-mail: ronnyk@sgi.com
|
||||
(c) Date received (databases may change over time without name change!)
|
||||
05/19/96
|
||||
3. Past Usage:
|
||||
(a) Complete reference of article where it was described/used
|
||||
@inproceedings{kohavi-nbtree,
|
||||
author={Ron Kohavi},
|
||||
title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a
|
||||
Decision-Tree Hybrid},
|
||||
booktitle={Proceedings of the Second International Conference on
|
||||
Knowledge Discovery and Data Mining},
|
||||
year = 1996,
|
||||
pages={to appear}}
|
||||
(b) Indication of what attribute(s) were being predicted
|
||||
Salary greater or less than 50,000.
|
||||
(b) Indication of study's results (i.e. Is it a good domain to use?)
|
||||
Hard domain with a nice number of records.
|
||||
The following results obtained using MLC++ with default settings
|
||||
for the algorithms mentioned below.
|
||||
|
||||
Algorithm Error
|
||||
-- ---------------- -----
|
||||
1 C4.5 15.54
|
||||
2 C4.5-auto 14.46
|
||||
3 C4.5 rules 14.94
|
||||
4 Voted ID3 (0.6) 15.64
|
||||
5 Voted ID3 (0.8) 16.47
|
||||
6 T2 16.84
|
||||
7 1R 19.54
|
||||
8 NBTree 14.10
|
||||
9 CN2 16.00
|
||||
10 HOODG 14.82
|
||||
11 FSS Naive Bayes 14.05
|
||||
12 IDTM (Decision table) 14.46
|
||||
13 Naive-Bayes 16.12
|
||||
14 Nearest-neighbor (1) 21.42
|
||||
15 Nearest-neighbor (3) 20.35
|
||||
16 OC1 15.04
|
||||
17 Pebls Crashed. Unknown why (bounds WERE increased)
|
||||
|
||||
4. Relevant Information Paragraph:
|
||||
Extraction was done by Barry Becker from the 1994 Census database. A set
|
||||
of reasonably clean records was extracted using the following conditions:
|
||||
((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
||||
|
||||
5. Number of Instances
|
||||
48842 instances, mix of continuous and discrete (train=32561, test=16281)
|
||||
45222 if instances with unknown values are removed (train=30162, test=15060)
|
||||
Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
|
||||
|
||||
6. Number of Attributes
|
||||
6 continuous, 8 nominal attributes.
|
||||
|
||||
7. Attribute Information:
|
||||
|
||||
age: continuous.
|
||||
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
|
||||
fnlwgt: continuous.
|
||||
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
|
||||
education-num: continuous.
|
||||
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
|
||||
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
|
||||
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
|
||||
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
|
||||
sex: Female, Male.
|
||||
capital-gain: continuous.
|
||||
capital-loss: continuous.
|
||||
hours-per-week: continuous.
|
||||
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
|
||||
class: >50K, <=50K
|
||||
|
||||
8. Missing Attribute Values:
|
||||
|
||||
7% have missing values.
|
||||
|
||||
9. Class Distribution:
|
||||
|
||||
Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
|
||||
Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
||||
|
||||
|
||||
155
part1.py
Normal file
155
part1.py
Normal file
@ -0,0 +1,155 @@
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from typing import Tuple
|
||||
import os
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
from sklearn.naive_bayes import BernoulliNB
|
||||
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
|
||||
|
||||
def process_data(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
|
||||
"""
|
||||
Processes the adult dataset by cleaning, removing continuous attributes, and one-hot encoding.
|
||||
|
||||
Args:
|
||||
train_path (str): The path to the training data file.
|
||||
test_path (str): The path to the test data file.
|
||||
|
||||
Returns:
|
||||
Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]: A tuple containing:
|
||||
- X_train_encoded: Processed and one-hot encoded training features.
|
||||
- y_train: Training labels.
|
||||
- X_test_encoded: Processed and one-hot encoded test features.
|
||||
- y_test: Test labels.
|
||||
"""
|
||||
columns: list[str] = [
|
||||
'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
|
||||
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
|
||||
'hours-per-week', 'native-country', 'income'
|
||||
]
|
||||
|
||||
# Load datasets
|
||||
df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
|
||||
df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
|
||||
|
||||
# Remove rows with any missing values
|
||||
df_train.dropna(inplace=True)
|
||||
df_test.dropna(inplace=True)
|
||||
|
||||
# Define continuous attributes to remove
|
||||
continuous_attributes: list[str] = [
|
||||
'age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week'
|
||||
]
|
||||
|
||||
# Separate features and target
|
||||
X_train: pd.DataFrame = df_train.drop(columns=['income'])
|
||||
y_train: pd.Series = df_train['income'].str.replace('.', '', regex=False)
|
||||
X_test: pd.DataFrame = df_test.drop(columns=['income'])
|
||||
y_test: pd.Series = df_test['income'].str.replace('.', '', regex=False)
|
||||
|
||||
# Remove continuous attributes
|
||||
X_train = X_train.drop(columns=continuous_attributes)
|
||||
X_test = X_test.drop(columns=continuous_attributes)
|
||||
|
||||
# Identify categorical attributes for one-hot encoding
|
||||
categorical_attributes: list[str] = X_train.columns.tolist()
|
||||
|
||||
# One-hot encode categorical attributes
|
||||
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
|
||||
X_train_encoded: pd.DataFrame = pd.DataFrame(encoder.fit_transform(X_train[categorical_attributes]), columns=encoder.get_feature_names_out(categorical_attributes))
|
||||
X_test_encoded: pd.DataFrame = pd.DataFrame(encoder.transform(X_test[categorical_attributes]), columns=encoder.get_feature_names_out(categorical_attributes))
|
||||
|
||||
return X_train_encoded, y_train, X_test_encoded, y_test
|
||||
|
||||
def evaluate_model(X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, y_test: pd.Series, model, model_name: str):
|
||||
"""
|
||||
Trains and evaluates a given model, printing a detailed report.
|
||||
|
||||
Args:
|
||||
X_train (pd.DataFrame): Training features.
|
||||
y_train (pd.Series): Training labels.
|
||||
X_test (pd.DataFrame): Test features.
|
||||
y_test (pd.Series): Test labels.
|
||||
model: The classifier model to evaluate.
|
||||
model_name (str): The name of the model for reporting.
|
||||
"""
|
||||
# Train the model
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
# Make predictions
|
||||
y_pred = model.predict(X_test)
|
||||
|
||||
# Generate the classification report
|
||||
report = classification_report(y_test, y_pred, output_dict=True)
|
||||
|
||||
# Calculate confusion matrix to get TP and FP rates
|
||||
# For binary classification: [[TN, FP], [FN, TP]]
|
||||
# For multi-class, we calculate per class
|
||||
cm = confusion_matrix(y_test, y_pred, labels=model.classes_)
|
||||
|
||||
print(f"--- {model_name} Evaluation ---")
|
||||
print(f"Overall Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")
|
||||
|
||||
for label in model.classes_:
|
||||
# Get the index for the current class
|
||||
class_idx = list(model.classes_).index(label)
|
||||
|
||||
# TP is the diagonal element
|
||||
tp = cm[class_idx, class_idx]
|
||||
|
||||
# FP is the sum of the column for this class, excluding the TP
|
||||
fp = cm[:, class_idx].sum() - tp
|
||||
|
||||
# FN is the sum of the row for this class, excluding the TP
|
||||
fn = cm[class_idx, :].sum() - tp
|
||||
|
||||
# TN is the sum of all cells minus the TP, FP, and FN for this class
|
||||
tn = cm.sum() - (tp + fp + fn)
|
||||
|
||||
# Rates
|
||||
tp_rate = tp / (tp + fn) if (tp + fn) > 0 else 0 # Same as recall
|
||||
fp_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
|
||||
|
||||
print(f"Class: {label}")
|
||||
print(f" TP Rate (Recall): {tp_rate:.4f}")
|
||||
print(f" FP Rate : {fp_rate:.4f}")
|
||||
print(f" Precision : {report[label]['precision']:.4f}")
|
||||
print(f" F1-Score : {report[label]['f1-score']:.4f}")
|
||||
print("-" * 20)
|
||||
|
||||
if __name__ == '__main__':
|
||||
train_file = 'adult/adult.data'
|
||||
test_file = 'adult/adult.test'
|
||||
output_dir = 'adult-clean'
|
||||
|
||||
# Create the output directory if it doesn't exist
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
X_train, y_train, X_test, y_test = process_data(train_file, test_file)
|
||||
|
||||
# Reset index to align features and labels for concatenation
|
||||
y_train = y_train.reset_index(drop=True)
|
||||
X_train = X_train.reset_index(drop=True)
|
||||
y_test = y_test.reset_index(drop=True)
|
||||
X_test = X_test.reset_index(drop=True)
|
||||
|
||||
# Concatenate features and labels
|
||||
train_cleaned = pd.concat([X_train, y_train], axis=1)
|
||||
test_cleaned = pd.concat([X_test, y_test], axis=1)
|
||||
|
||||
# Save the cleaned data to new CSV files
|
||||
train_cleaned.to_csv(os.path.join(output_dir, 'train_clean.csv'), index=False)
|
||||
test_cleaned.to_csv(os.path.join(output_dir, 'test_clean.csv'), index=False)
|
||||
|
||||
print(f"Preprocessed data saved to '{output_dir}' directory.")
|
||||
print(f"Training data shape: {train_cleaned.shape}")
|
||||
print(f"Test data shape: {test_cleaned.shape}\n")
|
||||
|
||||
# --- Model Training and Evaluation ---
|
||||
|
||||
# 1. Decision Tree Classifier
|
||||
dt_classifier = DecisionTreeClassifier(random_state=42)
|
||||
evaluate_model(X_train, y_train, X_test, y_test, dt_classifier, "Decision Tree Classifier")
|
||||
|
||||
# 2. Naïve Bayesian Classifier
|
||||
nb_classifier = BernoulliNB()
|
||||
evaluate_model(X_train, y_train, X_test, y_test, nb_classifier, "Naïve Bayesian Classifier")
|
||||
132
part2.py
Normal file
132
part2.py
Normal file
@ -0,0 +1,132 @@
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
from sklearn.metrics import accuracy_score
|
||||
from typing import Tuple, List
|
||||
import numpy as np
|
||||
|
||||
def process_data_part2(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
|
||||
"""
|
||||
Processes the adult dataset for part 2 requirements.
|
||||
- Removes unknown values.
|
||||
- Binarizes numerical attributes based on the mean.
|
||||
- One-hot encodes categorical attributes.
|
||||
"""
|
||||
columns: List[str] = [
|
||||
'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
|
||||
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
|
||||
'hours-per-week', 'native-country', 'income'
|
||||
]
|
||||
|
||||
# Load datasets
|
||||
df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
|
||||
df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
|
||||
|
||||
# Remove rows with any missing values
|
||||
df_train.dropna(inplace=True)
|
||||
df_test.dropna(inplace=True)
|
||||
|
||||
# Separate features and target, and clean target labels
|
||||
X_train_raw = df_train.drop('income', axis=1)
|
||||
y_train = df_train['income'].str.replace('.', '', regex=False)
|
||||
X_test_raw = df_test.drop('income', axis=1)
|
||||
y_test = df_test['income'].str.replace('.', '', regex=False)
|
||||
|
||||
# Identify numerical and categorical attributes
|
||||
numerical_cols = X_train_raw.select_dtypes(include=np.number).columns.tolist()
|
||||
categorical_cols = X_train_raw.select_dtypes(exclude=np.number).columns.tolist()
|
||||
|
||||
# --- Preprocessing ---
|
||||
|
||||
# 1. Binarize numerical attributes
|
||||
X_train_numerical_processed = pd.DataFrame()
|
||||
X_test_numerical_processed = pd.DataFrame()
|
||||
|
||||
for col in numerical_cols:
|
||||
mean_val = X_train_raw[col].mean()
|
||||
X_train_numerical_processed[col] = (X_train_raw[col] > mean_val).astype(int)
|
||||
X_test_numerical_processed[col] = (X_test_raw[col] > mean_val).astype(int)
|
||||
|
||||
# 2. One-hot encode categorical attributes
|
||||
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
|
||||
|
||||
X_train_categorical_processed = pd.DataFrame(
|
||||
encoder.fit_transform(X_train_raw[categorical_cols]),
|
||||
columns=encoder.get_feature_names_out(categorical_cols)
|
||||
)
|
||||
X_test_categorical_processed = pd.DataFrame(
|
||||
encoder.transform(X_test_raw[categorical_cols]),
|
||||
columns=encoder.get_feature_names_out(categorical_cols)
|
||||
)
|
||||
|
||||
# Reset index to ensure concatenation works correctly
|
||||
X_train_numerical_processed.index = X_train_categorical_processed.index
|
||||
X_test_numerical_processed.index = X_test_categorical_processed.index
|
||||
y_train.index = X_train_categorical_processed.index
|
||||
y_test.index = X_test_categorical_processed.index
|
||||
|
||||
# 3. Combine processed features
|
||||
X_train_processed = pd.concat([X_train_numerical_processed, X_train_categorical_processed], axis=1)
|
||||
X_test_processed = pd.concat([X_test_numerical_processed, X_test_categorical_processed], axis=1)
|
||||
|
||||
return X_train_processed, y_train, X_test_processed, y_test
|
||||
|
||||
def run_kmeans_clustering(X_train: pd.DataFrame, k_values: List[int]):
|
||||
"""
|
||||
Runs K-Means clustering for different k values and reports centroids.
|
||||
"""
|
||||
print("--- K-Means Clustering ---")
|
||||
for k in k_values:
|
||||
print(f"\nRunning K-Means with k={k}...")
|
||||
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
|
||||
kmeans.fit(X_train)
|
||||
|
||||
print(f"Centroids for k={k}:")
|
||||
# Printing only the first 5 dimensions for brevity
|
||||
print(pd.DataFrame(kmeans.cluster_centers_[:, :5], columns=X_train.columns[:5]))
|
||||
print("-" * 20)
|
||||
|
||||
def run_knn_classification(X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, y_test: pd.Series, k_values: List[int]):
|
||||
"""
|
||||
Runs kNN classification on the last 10 test samples and reports accuracy.
|
||||
"""
|
||||
print("\n--- k-Nearest Neighbors (kNN) Classification ---")
|
||||
|
||||
# Use the last 10 records from the test set
|
||||
X_test_sample = X_test.tail(10)
|
||||
y_test_sample = y_test.tail(10)
|
||||
|
||||
print(f"Predicting for the last {len(X_test_sample)} records of the test set.\n")
|
||||
|
||||
for k in k_values:
|
||||
knn = KNeighborsClassifier(n_neighbors=k)
|
||||
knn.fit(X_train, y_train)
|
||||
|
||||
y_pred_sample = knn.predict(X_test_sample)
|
||||
accuracy = accuracy_score(y_test_sample, y_pred_sample)
|
||||
|
||||
print(f"kNN with k={k}:")
|
||||
print(f" Prediction Accuracy: {accuracy:.2f}")
|
||||
print(f" Predicted Labels: {y_pred_sample}")
|
||||
print(f" Actual Labels: {y_test_sample.values}")
|
||||
print("-" * 20)
|
||||
|
||||
if __name__ == '__main__':
|
||||
train_file = 'adult/adult.data'
|
||||
test_file = 'adult/adult.test'
|
||||
|
||||
# Process data according to Part 2 requirements
|
||||
X_train, y_train, X_test, y_test = process_data_part2(train_file, test_file)
|
||||
|
||||
print("Data processing complete.")
|
||||
print(f"Training data shape: {X_train.shape}")
|
||||
print(f"Test data shape: {X_test.shape}\n")
|
||||
|
||||
# Run K-Means Clustering
|
||||
kmeans_k_values = [3, 5, 10]
|
||||
run_kmeans_clustering(X_train, kmeans_k_values)
|
||||
|
||||
# Run kNN Classification
|
||||
knn_k_values = [3, 5, 10]
|
||||
run_knn_classification(X_train, y_train, X_test, y_test, knn_k_values)
|
||||
107
part3.py
Normal file
107
part3.py
Normal file
@ -0,0 +1,107 @@
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.metrics import accuracy_score
|
||||
from typing import Tuple, List
|
||||
import numpy as np
|
||||
|
||||
# This is the same data processing function from Part 2
|
||||
def process_data_part2(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
|
||||
"""
|
||||
Processes the adult dataset for part 2 requirements.
|
||||
- Removes unknown values.
|
||||
- Binarizes numerical attributes based on the mean.
|
||||
- One-hot encodes categorical attributes.
|
||||
"""
|
||||
columns: List[str] = [
|
||||
'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
|
||||
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
|
||||
'hours-per-week', 'native-country', 'income'
|
||||
]
|
||||
|
||||
# Load datasets
|
||||
df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
|
||||
df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
|
||||
|
||||
# Remove rows with any missing values
|
||||
df_train.dropna(inplace=True)
|
||||
df_test.dropna(inplace=True)
|
||||
|
||||
# Separate features and target, and clean target labels
|
||||
X_train_raw = df_train.drop('income', axis=1)
|
||||
y_train = df_train['income'].str.replace('.', '', regex=False)
|
||||
X_test_raw = df_test.drop('income', axis=1)
|
||||
y_test = df_test['income'].str.replace('.', '', regex=False)
|
||||
|
||||
# Identify numerical and categorical attributes
|
||||
numerical_cols = X_train_raw.select_dtypes(include=np.number).columns.tolist()
|
||||
categorical_cols = X_train_raw.select_dtypes(exclude=np.number).columns.tolist()
|
||||
|
||||
# --- Preprocessing ---
|
||||
|
||||
# 1. Binarize numerical attributes
|
||||
X_train_numerical_processed = pd.DataFrame()
|
||||
X_test_numerical_processed = pd.DataFrame()
|
||||
|
||||
for col in numerical_cols:
|
||||
mean_val = X_train_raw[col].mean()
|
||||
X_train_numerical_processed[col] = (X_train_raw[col] > mean_val).astype(int)
|
||||
X_test_numerical_processed[col] = (X_test_raw[col] > mean_val).astype(int)
|
||||
|
||||
# 2. One-hot encode categorical attributes
|
||||
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
|
||||
|
||||
X_train_categorical_processed = pd.DataFrame(
|
||||
encoder.fit_transform(X_train_raw[categorical_cols]),
|
||||
columns=encoder.get_feature_names_out(categorical_cols),
|
||||
index=X_train_raw.index
|
||||
)
|
||||
X_test_categorical_processed = pd.DataFrame(
|
||||
encoder.transform(X_test_raw[categorical_cols]),
|
||||
columns=encoder.get_feature_names_out(categorical_cols),
|
||||
index=X_test_raw.index
|
||||
)
|
||||
|
||||
# 3. Combine processed features
|
||||
X_train_processed = pd.concat([X_train_numerical_processed, X_train_categorical_processed], axis=1)
|
||||
X_test_processed = pd.concat([X_test_numerical_processed, X_test_categorical_processed], axis=1)
|
||||
|
||||
# Align y labels with the processed X dataframes
|
||||
y_train = y_train.loc[X_train_processed.index]
|
||||
y_test = y_test.loc[X_test_processed.index]
|
||||
|
||||
return X_train_processed, y_train, X_test_processed, y_test
|
||||
|
||||
if __name__ == '__main__':
|
||||
train_file = 'adult/adult.data'
|
||||
test_file = 'adult/adult.test'
|
||||
|
||||
# Process data using the function from Part 2
|
||||
X_train, y_train, X_test, y_test = process_data_part2(train_file, test_file)
|
||||
|
||||
print("Data processing complete.")
|
||||
print(f"Training data shape: {X_train.shape}")
|
||||
print(f"Test data shape: {X_test.shape}\n")
|
||||
|
||||
# --- SVM Classifier ---
|
||||
print("--- Support Vector Machine (SVM) Classifier ---")
|
||||
|
||||
# Initialize SVM classifier. A linear kernel is often a good starting point.
|
||||
# Using a smaller subset for training due to SVM's computational complexity
|
||||
# For a full run, you would use the entire X_train, y_train
|
||||
print("Training the SVM classifier... (This may take a few minutes)")
|
||||
# Note: SVM can be slow on large datasets. For demonstration, you might
|
||||
# sample your data, e.g., X_train.sample(n=5000, random_state=42)
|
||||
svm_classifier = SVC(kernel='linear', random_state=42)
|
||||
|
||||
# Train the model on the full training data
|
||||
svm_classifier.fit(X_train, y_train)
|
||||
|
||||
# Make predictions on the test data
|
||||
print("Making predictions on the test data...")
|
||||
y_pred = svm_classifier.predict(X_test)
|
||||
|
||||
# Calculate and report the accuracy
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
|
||||
print(f"\nSVM Classifier Accuracy on Test Data: {accuracy:.4f}")
|
||||
105
part4.py
Normal file
105
part4.py
Normal file
@ -0,0 +1,105 @@
|
||||
import pandas as pd
|
||||
from sklearn.preprocessing import OneHotEncoder
|
||||
from sklearn.neural_network import MLPClassifier
|
||||
from sklearn.metrics import accuracy_score
|
||||
from typing import Tuple, List
|
||||
import numpy as np
|
||||
|
||||
# This is the same data processing function from Part 2
|
||||
def process_data_part2(train_path: str, test_path: str) -> Tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
|
||||
"""
|
||||
Processes the adult dataset for part 2 requirements.
|
||||
- Removes unknown values.
|
||||
- Binarizes numerical attributes based on the mean.
|
||||
- One-hot encodes categorical attributes.
|
||||
"""
|
||||
columns: List[str] = [
|
||||
'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
|
||||
'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
|
||||
'hours-per-week', 'native-country', 'income'
|
||||
]
|
||||
|
||||
# Load datasets
|
||||
df_train: pd.DataFrame = pd.read_csv(train_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?')
|
||||
df_test: pd.DataFrame = pd.read_csv(test_path, header=None, names=columns, sep=r',\s*', engine='python', na_values='?', skiprows=1)
|
||||
|
||||
# Remove rows with any missing values
|
||||
df_train.dropna(inplace=True)
|
||||
df_test.dropna(inplace=True)
|
||||
|
||||
# Separate features and target, and clean target labels
|
||||
X_train_raw = df_train.drop('income', axis=1)
|
||||
y_train = df_train['income'].str.replace('.', '', regex=False)
|
||||
X_test_raw = df_test.drop('income', axis=1)
|
||||
y_test = df_test['income'].str.replace('.', '', regex=False)
|
||||
|
||||
# Identify numerical and categorical attributes
|
||||
numerical_cols = X_train_raw.select_dtypes(include=np.number).columns.tolist()
|
||||
categorical_cols = X_train_raw.select_dtypes(exclude=np.number).columns.tolist()
|
||||
|
||||
# --- Preprocessing ---
|
||||
|
||||
# 1. Binarize numerical attributes
|
||||
X_train_numerical_processed = pd.DataFrame()
|
||||
X_test_numerical_processed = pd.DataFrame()
|
||||
|
||||
for col in numerical_cols:
|
||||
mean_val = X_train_raw[col].mean()
|
||||
X_train_numerical_processed[col] = (X_train_raw[col] > mean_val).astype(int)
|
||||
X_test_numerical_processed[col] = (X_test_raw[col] > mean_val).astype(int)
|
||||
|
||||
# 2. One-hot encode categorical attributes
|
||||
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
|
||||
|
||||
X_train_categorical_processed = pd.DataFrame(
|
||||
encoder.fit_transform(X_train_raw[categorical_cols]),
|
||||
columns=encoder.get_feature_names_out(categorical_cols),
|
||||
index=X_train_raw.index
|
||||
)
|
||||
X_test_categorical_processed = pd.DataFrame(
|
||||
encoder.transform(X_test_raw[categorical_cols]),
|
||||
columns=encoder.get_feature_names_out(categorical_cols),
|
||||
index=X_test_raw.index
|
||||
)
|
||||
|
||||
# 3. Combine processed features
|
||||
X_train_processed = pd.concat([X_train_numerical_processed, X_train_categorical_processed], axis=1)
|
||||
X_test_processed = pd.concat([X_test_numerical_processed, X_test_categorical_processed], axis=1)
|
||||
|
||||
# Align y labels with the processed X dataframes
|
||||
y_train = y_train.loc[X_train_processed.index]
|
||||
y_test = y_test.loc[X_test_processed.index]
|
||||
|
||||
return X_train_processed, y_train, X_test_processed, y_test
|
||||
|
||||
if __name__ == '__main__':
|
||||
train_file = 'adult/adult.data'
|
||||
test_file = 'adult/adult.test'
|
||||
|
||||
# Process data using the function from Part 2
|
||||
X_train, y_train, X_test, y_test = process_data_part2(train_file, test_file)
|
||||
|
||||
print("Data processing complete.")
|
||||
print(f"Training data shape: {X_train.shape}")
|
||||
print(f"Test data shape: {X_test.shape}\n")
|
||||
|
||||
# --- Neural Network Classifier ---
|
||||
print("--- Neural Network (MLP) Classifier ---")
|
||||
|
||||
# Initialize the Multi-layer Perceptron classifier
|
||||
# hidden_layer_sizes=(100,) means one hidden layer with 100 neurons.
|
||||
# max_iter=500 to ensure the model has enough iterations to converge.
|
||||
# random_state=42 for reproducibility.
|
||||
nn_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
|
||||
|
||||
print("Training the Neural Network classifier...")
|
||||
nn_classifier.fit(X_train, y_train)
|
||||
|
||||
# Make predictions on the test data
|
||||
print("Making predictions on the test data...")
|
||||
y_pred = nn_classifier.predict(X_test)
|
||||
|
||||
# Calculate and report the accuracy
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
|
||||
print(f"\nNeural Network Classifier Accuracy on Test Data: {accuracy:.4f}")
|
||||
2
requirements.txt
Normal file
2
requirements.txt
Normal file
@ -0,0 +1,2 @@
|
||||
pandas
|
||||
scikit-learn
|
||||
Loading…
x
Reference in New Issue
Block a user