pandas_tutorial_02

  • 라이브러리 불러오기
1
2
import pandas as pd
print(pd.__version__)
1.3.5

구글 드라이브 연동

  • 구글 드라이브 → colab notebook → 새 폴더 생성 : data → 슬랙에서 다운 받은 lemonade.csv 파일을 올린다 -> 다음 코드를 실행
1
2
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Mounted at ..drive 가 출력되었으므로 성공

현재 좌측에 폴더 그림 -> drive -> mydrive -> Colab Notebooks -> data -> supermarket_sales.csv를 찾아서 우클릭 -> 경로 복사 -> 다음 코드에 붙여넣어 사용

1
2
3
DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/data/supermarket_sales.csv'
sales = pd.read_csv(DATA_PATH)
sales

Invoice ID Branch City Customer type Gender Product line Unit price Quantity Date Time Payment
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 1/5/2019 13:08 Ewallet
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3/8/2019 10:29 Cash
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 3/3/2019 13:23 Credit card
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 1/27/2019 20:33 Ewallet
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 2/8/2019 10:37 Ewallet
... ... ... ... ... ... ... ... ... ... ... ...
995 233-67-5758 C Naypyitaw Normal Male Health and beauty 40.35 1 1/29/2019 13:46 Ewallet
996 303-96-2227 B Mandalay Normal Female Home and lifestyle 97.38 10 3/2/2019 17:16 Ewallet
997 727-02-1313 A Yangon Member Male Food and beverages 31.84 1 2/9/2019 13:22 Cash
998 347-56-2442 A Yangon Normal Male Home and lifestyle 65.82 1 2/22/2019 15:33 Cash
999 849-09-3807 A Yangon Member Female Fashion accessories 88.34 7 2/18/2019 13:28 Cash

1000 rows × 11 columns

  <script>
    const buttonEl =
      document.querySelector('#df-8a1e46d8-83ea-49d2-a98d-cf274f10b34d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-8a1e46d8-83ea-49d2-a98d-cf274f10b34d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Invoice ID     1000 non-null   object 
 1   Branch         1000 non-null   object 
 2   City           1000 non-null   object 
 3   Customer type  1000 non-null   object 
 4   Gender         1000 non-null   object 
 5   Product line   1000 non-null   object 
 6   Unit price     1000 non-null   float64
 7   Quantity       1000 non-null   int64  
 8   Date           1000 non-null   object 
 9   Time           1000 non-null   object 
 10  Payment        1000 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 86.1+ KB

Group by

  • (동의어) 집계함수를 배운다.
1
2
# 여러가지 시도해보면서 정보를 파악해보자
sales['Invoice ID'].value_counts()
750-67-8428    1
642-61-4706    1
816-72-8853    1
491-38-3499    1
322-02-2271    1
              ..
633-09-3463    1
374-17-3652    1
378-07-7001    1
433-75-6987    1
849-09-3807    1
Name: Invoice ID, Length: 1000, dtype: int64
1
2
# 여러가지 시도해보면서 정보를 파악해보자
sales.groupby('Customer type')['Quantity'].sum()
Customer type
Member    2785
Normal    2725
Name: Quantity, dtype: int64
1
sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].sum()
Customer type  Branch  Payment    
Member         A       Cash           308
                       Credit card    282
                       Ewallet        374
               B       Cash           284
                       Credit card    371
                       Ewallet        269
               C       Cash           293
                       Credit card    349
                       Ewallet        255
Normal         A       Cash           264
                       Credit card    298
                       Ewallet        333
               B       Cash           344
                       Credit card    228
                       Ewallet        324
               C       Cash           403
                       Credit card    194
                       Ewallet        337
Name: Quantity, dtype: int64
  • data type은 Series 이다.
1
print(type(sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].sum()))
<class 'pandas.core.series.Series'>
  • 검색 키워드를 잘 선택하는게 중요하다.
1
sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].agg(['sum', 'mean'])

sum mean
Customer type Branch Payment
Member A Cash 308 5.500000
Credit card 282 5.755102
Ewallet 374 6.032258
B Cash 284 5.358491
Credit card 371 5.888889
Ewallet 269 5.489796
C Cash 293 4.966102
Credit card 349 5.816667
Ewallet 255 5.100000
Normal A Cash 264 4.888889
Credit card 298 5.418182
Ewallet 333 5.203125
B Cash 344 6.035088
Credit card 228 4.956522
Ewallet 324 5.062500
C Cash 403 6.200000
Credit card 194 5.105263
Ewallet 337 6.017857

  <script>
    const buttonEl =
      document.querySelector('#df-9f19e00c-ea81-404c-b289-c9ddb325aeaa button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-9f19e00c-ea81-404c-b289-c9ddb325aeaa');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
print(type(sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].agg(['sum', 'mean'])))
<class 'pandas.core.frame.DataFrame'>
1
sales.groupby(['Customer type', 'Branch', 'Payment'], as_index=False)['Quantity'].agg(['sum', 'mean'])

sum mean
Customer type Branch Payment
Member A Cash 308 5.500000
Credit card 282 5.755102
Ewallet 374 6.032258
B Cash 284 5.358491
Credit card 371 5.888889
Ewallet 269 5.489796
C Cash 293 4.966102
Credit card 349 5.816667
Ewallet 255 5.100000
Normal A Cash 264 4.888889
Credit card 298 5.418182
Ewallet 333 5.203125
B Cash 344 6.035088
Credit card 228 4.956522
Ewallet 324 5.062500
C Cash 403 6.200000
Credit card 194 5.105263
Ewallet 337 6.017857

  <script>
    const buttonEl =
      document.querySelector('#df-f56f9b0d-43e2-4ba0-8abf-56fa96c5d20f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f56f9b0d-43e2-4ba0-8abf-56fa96c5d20f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

결측치 다루기

  • 결측치 데이터 생성
    • 임의로 여러가지 생성해보자 (숙달 과정)
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import numpy as np

dict_01 = {
'Score_A' : [80, 90, np.nan, 80],
'Score_B' : [30, 45, np.nan, np.nan],
'Score_C' : [np.nan, 50, 80, 90],
}

df = pd.DataFrame(dict_01)
df

Score_A Score_B Score_C
0 80.0 30.0 NaN
1 90.0 45.0 50.0
2 NaN NaN 80.0
3 80.0 NaN 90.0

  <script>
    const buttonEl =
      document.querySelector('#df-6c9ddc3e-23cb-46c2-bcca-8e0adaab788b button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-6c9ddc3e-23cb-46c2-bcca-8e0adaab788b');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • True = 숫자 1로 인식

  • False = 숫자 0으로 인식

  • 결측치 (Nan) 개수 세기

1
df.isnull().sum()
Score_A    1
Score_B    2
Score_C    1
dtype: int64
  • 결측치를 다른 것으로 채우기
1
df.fillna("0")

Score_A Score_B Score_C
0 80.0 30.0 0
1 90.0 45.0 50.0
2 0 0 80.0
3 80.0 0 90.0

  <script>
    const buttonEl =
      document.querySelector('#df-ce166771-c2da-430d-aa22-8b7cac811d94 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-ce166771-c2da-430d-aa22-8b7cac811d94');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
# 바로 윗칸의 데이터로 채우기
df.fillna(method="pad")

Score_A Score_B Score_C
0 80.0 30.0 NaN
1 90.0 45.0 50.0
2 90.0 45.0 80.0
3 80.0 45.0 90.0

  <script>
    const buttonEl =
      document.querySelector('#df-14c34a17-7745-4466-a779-62f00b5030de button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-14c34a17-7745-4466-a779-62f00b5030de');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
2
3
4
5
6
7
dict_01 = {
"성별" : ["남자", "여자", np.nan, "남자"],
"Salary" : [30, 45, 90, 70],
}

df = pd.DataFrame(dict_01)
df

성별 Salary
0 남자 30
1 여자 45
2 NaN 90
3 남자 70

  <script>
    const buttonEl =
      document.querySelector('#df-54d9e838-5824-4cb9-9f0f-8291411d9270 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-54d9e838-5824-4cb9-9f0f-8291411d9270');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
1
df['성별'].fillna("성별 없음")
0       남자
1       여자
2    성별 없음
3       남자
Name: 성별, dtype: object
  • 결측치

–> 문자열 타입이랑 / 숫자 타입이랑 접근 방법이 다름
–> 문자열(빈도 –> 가장 많이 나타나는 문자열 넣어주기!, 최빈값)
–> 숫자열(평균, 최대, 최소, 중간, 기타 등등..)

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np

dict_01 = {
'Score_A' : [80, 90, np.nan, 80],
'Score_B' : [30, 45, np.nan, np.nan],
'Score_C' : [np.nan, 50, 80, 90],
'Score_D' : [50, 30, 80, 60],
}

df = pd.DataFrame(dict_01)
df

Score_A Score_B Score_C Score_D
0 80.0 30.0 NaN 50
1 90.0 45.0 50.0 30
2 NaN NaN 80.0 80
3 80.0 NaN 90.0 60

  <script>
    const buttonEl =
      document.querySelector('#df-a1c2a0ad-c902-4c13-8d35-ae4931ac7c3d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-a1c2a0ad-c902-4c13-8d35-ae4931ac7c3d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 결측치가 있을 때 열을 지운다.
    • axis = 1 -> columns
1
df.dropna(axis = 1)

Score_D
0 50
1 30
2 80
3 60

  <script>
    const buttonEl =
      document.querySelector('#df-a9658aae-24a1-43bd-a6fe-73b30c752e90 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-a9658aae-24a1-43bd-a6fe-73b30c752e90');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 결측치가 있을 때 행을 지운다.
    • axis = 0 -> index
1
df.dropna(axis = 0)

Score_A Score_B Score_C Score_D
1 90.0 45.0 50.0 30

  <script>
    const buttonEl =
      document.querySelector('#df-1359a925-a759-4d30-9b57-10abcaf3af1a button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-1359a925-a759-4d30-9b57-10abcaf3af1a');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

이상치

1
sales

Invoice ID Branch City Customer type Gender Product line Unit price Quantity Date Time Payment
0 750-67-8428 A Yangon Member Female Health and beauty 74.69 7 1/5/2019 13:08 Ewallet
1 226-31-3081 C Naypyitaw Normal Female Electronic accessories 15.28 5 3/8/2019 10:29 Cash
2 631-41-3108 A Yangon Normal Male Home and lifestyle 46.33 7 3/3/2019 13:23 Credit card
3 123-19-1176 A Yangon Member Male Health and beauty 58.22 8 1/27/2019 20:33 Ewallet
4 373-73-7910 A Yangon Normal Male Sports and travel 86.31 7 2/8/2019 10:37 Ewallet
... ... ... ... ... ... ... ... ... ... ... ...
995 233-67-5758 C Naypyitaw Normal Male Health and beauty 40.35 1 1/29/2019 13:46 Ewallet
996 303-96-2227 B Mandalay Normal Female Home and lifestyle 97.38 10 3/2/2019 17:16 Ewallet
997 727-02-1313 A Yangon Member Male Food and beverages 31.84 1 2/9/2019 13:22 Cash
998 347-56-2442 A Yangon Normal Male Home and lifestyle 65.82 1 2/22/2019 15:33 Cash
999 849-09-3807 A Yangon Member Female Fashion accessories 88.34 7 2/18/2019 13:28 Cash

1000 rows × 11 columns

  <script>
    const buttonEl =
      document.querySelector('#df-04da7866-a736-4456-8d77-a7760df771c5 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-04da7866-a736-4456-8d77-a7760df771c5');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 일반적인 통계적 공식

  • IQR - 박스플롯 - 사분위수

  • Q0(0), Q1(25%), Q2(50%), Q3(75%), Q4(100%)

  • 이상치의 하한 경계값 : Q1 - 1.5 * (Q3-Q1)

  • 이상치의 상한 경계값 : Q3 + 1.5 * (Q3-Q1)

  • 도메인 (각 비즈니스 영역, 미래 일자리) 에서 바라보는 이상치 기준 (관습)

1
sales[['Unit price']]. describe()

Unit price
count 1000.000000
mean 55.672130
std 26.494628
min 10.080000
25% 32.875000
50% 55.230000
75% 77.935000
max 99.960000

  <script>
    const buttonEl =
      document.querySelector('#df-6bdae016-4a2d-4f55-9e51-5f68e6af7217 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-6bdae016-4a2d-4f55-9e51-5f68e6af7217');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
  • 이상치의 하한 경계값 : Q1 - 1.5 * (Q3-Q1)
  • 이런 공식은 통계적으로 타당하지만 그 외에도 이상치인지 판단할 방법이 있다.
1
2
3
4
5
6
7
8
Q1 = sales['Unit price'].quantile(0.25)
Q3 = sales['Unit price'].quantile(0.75)

# Q1보다 낮은 값을 이상치로 간주
outliers_q1 = (sales['Unit price'] < Q1)

# Q1보다 높은 값을 이상치로 간주
outliers_q3 = (sales['Unit price'] > Q3)
  • 이 코드는 특히 중요하다
1
print(sales['Unit price'][~(outliers_q1 | outliers_q3)])
0      74.69
2      46.33
3      58.22
6      68.84
7      73.56
       ...  
991    76.60
992    58.03
994    60.95
995    40.35
998    65.82
Name: Unit price, Length: 500, dtype: float64
Author

minkuen

Posted on

2022-03-24

Updated on

2022-03-24

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.