본문 바로가기

빅데이터분석기사_실기/제2유형: 데이터 분석

제 2유형 [분류 Classification] 타이타닉

이대로만 하면 2유형 40점 만점!

 

https://www.kaggle.com/code/agileteam/t2-1-titanic-simple-baseline

 

T2-1. 타이타닉(Titanic) Simple Baseline

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

 

 

#데이터로드 및 확인
import pandas as pd

print(X_train.info())
print(X_test.info())
print(y_train.info())

print(X_train.isnull().sum())
print(X_test.isnull().sum())
print(y_train.isnull().sum())
'''
X_train, X_test: 결측치 존재(Age,Cabin, Embarked)
X_train과 X_test에서 삭제 할 컬럼: PassengerId, Name, Cabin
Label Endoding 할 컬럼: Sex, Ticket, Embarked
'''

#데이터 전처리: 결측치 처리
X_train['Age'].fillna(X_train['Age'].mode()[0], inplace = True)
X_train['Embarked'].fillna(X_train['Embarked'].mode()[0], inplace = True)

X_test['Age'].fillna(X_test['Age'].mode()[0], inplace = True)
X_test['Embarked'].fillna(X_test['Embarked'].mode()[0], inplace = True)

print(X_train.info())
print(X_test.info())

#데이터 전처리: 레이블 인코딩

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

X_train['Sex'] = le.fit_transform(X_train['Sex'])
X_train['Ticket'] = le.fit_transform(X_train['Ticket'])
X_train['Embarked'] = le.fit_transform(X_train['Embarked'])

X_test['Sex'] = le.fit_transform(X_test['Sex'])
X_test['Ticket'] = le.fit_transform(X_test['Ticket'])
X_test['Embarked'] = le.fit_transform(X_test['Embarked'])

print(X_train.info())
print(X_test.info())

#데이터 분할

from sklearn.model_selection import train_test_split

X = X_train.drop(columns = ['PassengerId', 'Name', 'Cabin'])
y = y_train['Survived']

x_train, x_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2,

#모델링 및 학습
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators = 30, max_depth = 10, random_state = 11)
rfc.fit(x_train, Y_train)
pred1 = rfc.predict(x_test)

#성능평가
from sklearn.metrics import roc_auc_score, accuracy_score
roc = roc_auc_score(Y_test, pred1)
acc = accuracy_score(Y_test, pred1)

print(roc, acc)

#테스트데이터로 예측
test_X = X_test.drop(columns = ['PassengerId', 'Name', 'Cabin'])
pred2 = rfc.predict(test_X)

#결과 데이터 제출 및 확인
pd.DataFrame({'PassengerId': X_test['PassengerId'], 'pred': pred2}).to_csv('result.csv', index = False)
result = pd.read_csv('result.csv')

print(result)

 


데이터 전처리 반복문으로 한번에 !

 

1. 결측치 처리

for column in df.columns:
	mode_value = df[column].mode()[0]
	df[column].fillna(mode_value, inplace = True)

 

2. 레이블 인코딩

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for column in df.select_dtypes(include = [‘object’]).columns:
	df[column] = le.fit_transform(df[column])