Ch12~13. 고급 pandas & 파이썬 모델링 라이브러리

데이터웨어하우스의 경우 구별되는 값을 담고 있는 차원테이블과 그 테이블을 참조하는 정수키를 사용하는 것 일 일반적이다.

values = pd.Series([0,1,0,0]*2)

values 
>>
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

dim.take(values)
>>
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

-- take 메서드를 사용하면 Series내에 저장된 원래 문자열을 구할 수 있다

범주형/사전형 표기법 : 정수로 표현된 값 범주, 사전, 또는 단계 데이터 : 별개의 값을 담고 있는 배열 (=categorical 데이터) 범주 코드 (=코드) : 범주형 데이터를 가리키는 정수값

범주형 표기법을 사용함녀 분석 작업에 있어서 성능 향상을 얻을 수 있다

pandas의 Categorical

fruits = ['apple','orange','apple','apple']*2
n=len(fruits)

df=pd.DataFrame({'fruit':fruits, 
                 'basket_id':np.arange(n),
                 'count':np.random.randint(3,15,size=n),
                 'weight':np.random.uniform(0,4,size=n)},
                 columns=['basket_id','fruit','count','weight'])
                 
df
>> 
    basket_id     fruit    count    weight
0        0        apple      13     2.270653
1        1       orange      12     2.199684
2        2        apple       7     1.633151
3        3        apple      11     3.259698
4        4        apple      13     2.032409
5        5       orange      10     0.481295
6        6        apple       7     2.822669
7        7        apple       6     1.411839

fruit_cat=df['fruit'].astype('category')
fruit_cat
>>
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

-- fruit_cat의 값은 NumPy 배열이 아니라 
-- pandas.Categorical의 인스턴스다

범주형으로 변경하는 경우 명시적으로 지정하지 않는 한 특정 순서를 보장하지 않는다. categorical은 통계 및 연산을 할 수 있다 특정 데이터셋에 대해 다양한 분석을 하는 경우, 범주형으로 변환하는 것만으로도 성능 개선할 수 있다

메서드 연결 기법에는 assign, pipe 메서드가 있다.

PreviousCh11. 시계열 NextSQL

Last updated 6 years ago

hashtagpandas의 Categorical

pandas의 Categorical