본문 바로가기

빅데이터분석기사_실기/제2유형: 데이터 분석

제 2유형 [분류 Classification] 성인 인구소득 (범주형)

이대로만 하면 2유형 40점 만점!

https://www.kaggle.com/code/agileteam/t2-3-adult-census-income-tutorial

 

T2-3. Adult Census Income Tutorial

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

#X_train, y_train, X_test(제출 테스트 데이터)

#데이터 정보 확인
# print(X_train.head())
# print(y_train.head())

# print(X_train.info())
# print(X_test.info())
# print(y_train.info())

# print(X_train.isnull().sum())
# print(X_test.isnull().sum())
# print(y_train.isnull().sum())
'''
결측치 존재 (X_train, X_test): workclass, occupation, native.country 
-> mode()[0]으로 처리
Label Encoding 할 컬럼(X_train, X_test): workclass, education, marital.status, occupation, relationship, race, sex, native.country 
Label Encoding 할 컬럼(y_train): income

독립변수: id만 제외
종속변수: income
'''
#데이터 전처리 - 결측치 처리
X_train['workclass'].fillna(X_train['workclass'].mode()[0], inplace = True)
X_train['occupation'].fillna(X_train['occupation'].mode()[0], inplace = True)
X_train['native.country'].fillna(X_train['native.country'].mode()[0], inplace = True)

X_test['workclass'].fillna(X_test['workclass'].mode()[0], inplace = True)
X_test['occupation'].fillna(X_test['occupation'].mode()[0], inplace = True)
X_test['native.country'].fillna(X_test['native.country'].mode()[0], inplace = True)

# print(X_train.isnull().sum())
# print(X_test.isnull().sum())
# print(y_train.isnull().sum())

#데이터 전처리 - 레이블 인코딩
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

X_train['workclass'] = le.fit_transform(X_train['workclass'])
X_train['education'] = le.fit_transform(X_train['education'])
X_train['marital.status'] = le.fit_transform(X_train['marital.status'])
X_train['occupation'] = le.fit_transform(X_train['occupation'])
X_train['relationship'] = le.fit_transform(X_train['relationship'])
X_train['race'] = le.fit_transform(X_train['race'])
X_train['sex'] = le.fit_transform(X_train['sex'])
X_train['native.country'] = le.fit_transform(X_train['native.country'])

X_test['workclass'] = le.fit_transform(X_test['workclass'])
X_test['education'] = le.fit_transform(X_test['education'])
X_test['marital.status'] = le.fit_transform(X_test['marital.status'])
X_test['occupation'] = le.fit_transform(X_test['occupation'])
X_test['relationship'] = le.fit_transform(X_test['relationship'])
X_test['race'] = le.fit_transform(X_test['race'])
X_test['sex'] = le.fit_transform(X_test['sex'])
X_test['native.country'] = le.fit_transform(X_test['native.country'])

y_train['income'] = le.fit_transform(y_train['income'])

# print(X_train.info())
# print(X_test.info())
# print(y_train.info())

#데이터 분할
from sklearn.model_selection import train_test_split
X = X_train.drop(columns = ['id'])
y = y_train['income']

x_train, x_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 11)

#모델링, 학습, 임시테스트
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 30, max_depth = 10, random_state = 11)

rfc.fit(x_train, Y_train)
pred1 = rfc.predict(x_test)

#임시테스트 성능평가
from sklearn.metrics import accuracy_score, roc_auc_score
acc = accuracy_score(Y_test, pred1)
roc = roc_auc_score(Y_test, pred1)

print(acc,roc)

#테스트데이터로 예측
test_X = X_test.drop(columns = ['id'])
pred2 = rfc.predict(test_X)

#결과 데이터 생성 및 확인
pd.DataFrame({'id': X_test['id'], 'pred': pred2}).to_csv('result.csv', index= False)
result = pd.read_csv('result.csv')
print(result)

 


데이터 전처리 반복문으로 한번에 !

 

1. 결측치 처리

for column in df.columns:
	mode_value = df[column].mode()[0]
	df[column].fillna(mode_value, inplace = True)

 

2. 레이블 인코딩

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for column in df.select_dtypes(include = [‘object’]).columns:
	df[column] = le.fit_transform(df[column])