Posted 2022-05-01Updated 2022-05-04 minkuen python / library / pyspark7 minutes read (About 1109 words)

Spark ML

Spark로 머신러닝을 사용해 본다.
실용성과 별개로 경험삼아 작성해보는 코드이다.
머신러닝(ML)은 Scikit-Learn을 중점적으로 공부해야 한다.
딥러닝(DL)은 Tensorflow, Pytorch에 포커스를 맞춰야 한다.

사전준비

git bash로 VSCord에 들어가 터밀널을 연다.

바탕화면 우클릭 : git bash here

→ cd pyspk_project

→ code .

→ git bash 터미널

pyspark_ml 실습(1)

가상환경 진입하고 폴더, 파일 생성

→ source venv/Scripts/activate

→ 폴더 생성 : chapter03_ml

→ cd chapter03_ml

슬랙에서 data.zip 을 다운로드
압축을 풀고 chapter03_ml 폴더에 복사하여 옮긴다.

→ 파일 생성 : step01_regression.py

→ 코드 작성

from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler

# 세션 할당
spark = SparkSession.builder.appName("DecisionTree").getOrCreate()

# 데이터 불러오기
# StructType 이 과정 생략
data = spark.read.option("header", "true").option("inferSchema", "true").csv("data/realestate.csv")

# 데이터 프레임을 행렬로 변환
assembler = VectorAssembler().setInputCols(['HouseAge', 'DistanceToMRT', 'NumberConvenienceStores']).setOutputCol("features")

# 타겟데이터 설정
df = assembler.transform(data).select("PriceofUnitArea", "features")

# 데이터 분리
trainTest = df.randomSplit([0.5, 0.5])
trainingDF = trainTest[0]
testDF = trainTest[1]

# Decision Tree 클래스 정의
dtr = DecisionTreeRegressor().setFeaturesCol("features").setLabelCol("PriceofUnitArea")

# 모델 학습
model = dtr.fit(trainingDF)
print(model)

# 모델 예측
fullPredictions = model.transform(testDF).cache()

# 예측값과 Label을 분리
predictions = fullPredictions.select("prediction").rdd.map(lambda x: x[0])

# 실제데이터
labels = fullPredictions.select("PriceofUnitArea").rdd.map(lambda x: x[0])

# zip
preds_label = predictions.zip(labels).collect()

for prediction in preds_label:
    print(prediction)

# print(data.show())

# 세션 종료
spark.stop()

→ 저장 후 실행

→ python step01_regression.py

→ 다음과 같이 출력된다.

Untitled

pyspark_ml 실습(2)

→ 파일 생성 : step02_logistic_regression.py

→ 코드 작성

# 세션 할당
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression # 기억

spark = SparkSession.builder.appName("AppName").getOrCreate()

# 데이터 불러오기
training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
print("hello")

# 모델 만들기
# Scikit-Learn 문법과 비슷
mlr = LogisticRegression() # 기억
mlr_model = mlr.fit(training) # 기억

# 로지스틱 회귀, 선형 모델.. 기울기와 상수
print("Coefficients: " + str(mlr_model.coefficients))
print("Intercept: " + str(mlr_model.intercept))

spark.stop()

→ 저장 후 실행

→ python step02_logistic_regression.py

pyspark_ml 실습(3)

pyspark_pipeline

→ 파일 생성 : step03_pipeline.py

→ 코드 작성

from tokenize import Token
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

from pyspark.sql import SparkSession

# 세션 할당
spark = SparkSession.builder.appName("MLPipeline").getOrCreate()

# 가상의 데이터 만들기
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Feature Engineering
# 요리 작업

# 요리준비 1단계 : 텍스트를 단어로 분리
tokenizer = Tokenizer(inputCol='text', outputCol='words')

# 요리준비 2단계 : 변환된 텍스트를 숫자로 변환
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

# 요리준비 3단계 : 모델을 가져옴
lr = LogisticRegression(maxIter=5, regParam=0.01)

# 요리 시작
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# 메인재료 투하
model = pipeline.fit(training)

# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# 예측
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    row_id, text, prob, prediction = row # 튜플
    print(
        # 문자열 포맷팅
        "(%d, %s) -------> probability=%s, prediction=%f" % (row_id, text, str(prob), prediction)
    )

# training.show()

# 세션 종료
spark.stop()

→ 저장 후 실행

→ python step03_pipeline.py

Untitled

pyspark_ml 실습(3)

→ 파일 생성 : step03_randomforest.py

→ 코드 작성

from cProfile import label
from pyspark.sql import SparkSession

# 머신러닝 라이브러리
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# 데이터 불러오기
spark = SparkSession.builder.appName("RandomForest").getOrCreate()

data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
print(type(data))

# Feature Engineering
# label column
labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel').fit(data)

# 범주형 데이터 체크, 인덱스화
featureIndexer = VectorIndexer(inputCol='features',
                               outputCol='IndexedFeatures', maxCategories=4).fit(data)

# 데이터 분리
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# 모델
rf = RandomForestClassifier(labelCol='indexedLabel', # 종속변수
                            featuresCol='IndexedFeatures', # 독립변수
                            numTrees=10)

# outputCol='indexedLabel' --> original label로 변환
labelConvereter = IndexToString(inputCol='prediction',
                                outputCol='predictedLabel', labels=labelIndexer.labels)

# 파이프라인 구축
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConvereter])

# 모델 학습
model = pipeline.fit(trainingData)

# 모델 예측
predictions = model.transform(testData)

# 행에 표시할 것 추출
predictions.select("predictedLabel", 'label', 'features').show(5)

# 모형 평가
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)
print("Test Error = %f " % (1.0 - accuracy))

spark.stop()

→ 저장 후 실행

→ python step04_randomforest.py

Untitled

팁

venv 생성되어 있는 경로로 이동

→ pip install jupyterlab

→ jupyter lab

→ 주피터랩에서 블로그에 올릴 자료 작성 가능.

Reference : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html

Posted 2022-04-28Updated 2022-05-01 minkuen python / library / pyspark5 minutes read (About 768 words)

pyspark 실습03

사전준비

git bash로 VSCord에 들어가 터밀널을 연다.

바탕화면 우클릭 : git bash here

→ cd pyspk_project

→ code .

→ git bash 터미널

pyspark 실습(1)

가상환경 진입하고 폴더, 파일 생성

→ source venv/Scripts/activate

→ 폴더 생성 : chapter02_get_cleansing

슬랙에서 data.zip 을 다운로드
압축을 풀고 chapter02_get_cleansing 파일에 복사하여 옮긴다.

→ 파일 생성 : pipeline.py

Untitled

코드를 작성해본다.

→ 코드 작성

from pyspark.sql import SparkSession

from pyspark.sql.functions import *

print("Hello!")

→ 저장

→ 경로 이동 : cd chapter02_get_cleansing

→ 실행 : python pipeline.py

이어서 코드작성

→ pipeline.py를 다음과 같이 작성

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F

# print("Hello!!")

# 스파크 세션을 생성
spark = SparkSession.builder.master("local[1]").\
    appName("quickpipeline").getOrCreate()

# 데이터 불러오기
df = spark.read.csv("data\AA_DFW_2015_Departures_Short.csv.gz"
                    , header = True)

print("file loaded")

print(df.show())

# remove duration = 0
df = df.filter(df[3] > 0)

# ADD ID column
df = df.withColumn('id', F.monotonically_increasing_id())
df.show()

df.write.csv("data/output.csv", mode = "overwrite")

spark.stop()

→ 저장 후 실행

→ python pipeline.py

→ output.csv 가 생성되면 성공이다.

Untitled

pyspark 실습(2)

온도를 측정하는 코드를 작성해본다.
슬랙에서 다운로드
- 1800.csv, book.txt, customer-orders.csv, fakefriends.csv
chapter02_get_cleansing/data 파일에 복사하여 옮긴다.

파일 생성 : min_temp.py

→ 코드 작성

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local').setAppName('MinTemperatures')

sc = SparkContext(conf = conf)

print("Hello")

→ 저장 후 실행

→ python min_temp.py

이어서 코드작성

→ min_temp.py를 다음과 같이 작성

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster(‘local’).setAppName(‘MinTemperatures’)

sc = SparkContext(conf = conf)

print(“Begins…”)

def parseLine(line):

fileds = line.split(‘,’) # 문자열을 split

stationID = fileds[0]

entryType = fileds[2]

temperature = float(fileds[3]) * 0.1 * (9.0 / 5.0) + 32.0

return (stationID, entryType, temperature)

lines = sc.textFile(‘data/1800.csv’)

#print(lines)

parseLines = lines.map(parseLine)

#print(parseLine)

minTemps = parseLine.filter(lambda x : “TMIN” in x[1])

stationTemps = minTemps.map(lambda x: (x[0], x[2]))

minTemps = stationTemps.map(lambda x, y: min(x,y))

results = minTemps.collect()

print(results)

→ 저장 후 실행

→ python min_temp.py

pyspark 실습(3)

나이를 출력하는 코드를 작성해보자

파일 생성 : friends-by-age.py

→ 코드 작성

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("FriendsByAge")
sc = SparkContext(conf = conf)

def parseLine(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

lines = sc.textFile("logs/fakefriends.csv")
rdd = lines.map(parseLine)
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])
results = averagesByAge.collect()
for result in results:
    print(result)

→ 저장 후 실행

→ python friends-by-age.py

pyspark 실습(4)

파일 생성 : totalspent.py

→ 코드 작성

# 라이브러리 불러오기
from pyspark import SparkConf, SparkContext

# 사용자 정의 함수
def extractCusPrice(line):
    fields = line.split(",")
    return (int(fields[0]), float(fields[2]))

# main 함수
def main():

    # 스파크 설정
    conf = SparkConf().setMaster("local").setAppName('SpentbyCustomer')
    sc = SparkContext(conf = conf)

    # 데이터 불러오기
    input = sc.textFile("data/customer-orders.csv")
    # print("is data?")
    mappedInput = input.map(extractCusPrice)
    totalByCustomer = mappedInput.reduceByKey(lambda x, y : x + y)
		# 정렬
    filpped = totalByCustomer.map(lambda x: (x[1], x[0]))
    totalByCustomerStored = filpped.sortByKey()

    results = totalByCustomer.collect()
    for result in results:
        print(result)

# 실행 코드
if __name__ == "__main__":
    main()

→ 저장 후 실행

→ python totalspent.py

Reference : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html

Posted 2022-04-27Updated 2022-04-30 minkuen python / library / pyspark4 minutes read (About 632 words)

pyspark 실습02

사전준비

git bash로 VSCord에 들어가 터미널을 연다.

바탕화면 우클릭 : git bash here

→ cd pyspk_project

→ code .

→ git bash 터미널

pyspark 실습(1)

가상환경 진입하고 파일 생성

→ source venv/Scripts/activate

→ chapter01_get_starged 폴더에서 파일 생성

→ 파일 생성 : step04_structype.py

키워드 : Struct Type

구글링 : Spark Struct Type, Spark Struct

참고 링크 :

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html

→ 코드 작성

from struct import Struct
from pyspark.sql import SparkSession
from pyspark.sql import functions as func 
from pyspark.sql.types import StructType, StructField, IntegerType, LongType

# 세션 할당 (필수)
# spark = SparkSession.builder.appName("")
spark = SparkSession.builder.appName("PopularMovies").getOrCreate()

# 스키마 작성 (u.logs 데이터)
schema = StructType(
    [
        StructField("userID", IntegerType(), True),
        StructField("movieID", IntegerType(), True),
        StructField("rating", IntegerType(), True),
        StructField("timestamp", LongType(), True)
    ]
) 

print("Schema is done")

# 데이터 불러오기
movies_df = spark.read.option("sep", "\t").schema(schema).csv("ml-100k/u.logs")

# 내림차순으로 인기있는 영화 정렬
# movieID 그룹바이. count() orderby
toMovieIds = movies_df.groupBy("movieID").count().orderBy(func.desc('count'))

print(movies_df.show(10))

# 세션 종료
spark.stop()

→ 경로 이동 : cd chapter01_get_started

→ 저장 후 실행

→ python step04_structype.py

→ 다음 테이블이 출력되어야 한다.

Untitled

pyspark 실습(2)

→ chapter01_get_starged 폴더에서 파일 생성

→ 파일 생성 : step05_advancestructype.py

→ 코드 작성

from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import StructType, StructField, IntegerType, LongType
import codecs

print("Hello")

def loadMovieNames():
    movieNames = {}
    with codecs.open("ml-100k/u.ITEM", "r", encoding="ISO-8859-1", errors="ignore") as f:
        for line in f:
            fields = line.split("|")
            movieNames[int(fields[0])] = fields[1]
    return movieNames

# 세션 할당
spark = SparkSession.builder.appName("PopularMovies").getOrCreate()

# 파이썬 딕셔너리 객체를 Spark 객체로 변환
nameDict = spark.sparkContext.broadcast(loadMovieNames())

# 스키마 작성 (u.logs 데이터)
schema = StructType(
    [
        StructField("userID", IntegerType(), True)
        , StructField("movieID", IntegerType(), True)
        , StructField("rating", IntegerType(), True)
        , StructField("timestamp", LongType(), True)
    ]
)

print("Schema is done")

# 데이터 불러오기
movies_df = spark.read.option("sep", "\t").schema(schema).csv("ml-100k/u.logs")

# 내림차순으로 인기있는 영화 정렬
# movieID 그룹바이. count() orderby
topMovieIds = movies_df.groupBy("movieID").count()

# 딕셔너리
# key-value
# 키 값을 알면 value 자동으로 가져옴 (movieTitle)
def lookupName(movieID):
    return nameDict.value[movieID]

lookupNameUDF = func.udf(lookupName)

# MovieTitle 기존 topMovieIds 데이터에 추가
# 컬럼을 추가
moviesWithNames = topMovieIds.withColumn("movieTitle", lookupNameUDF(func.col("movieID")))

final_df = moviesWithNames.orderBy(func.desc("count"))

print(final_df.show(10))

# 세션 종료
spark.stop()

→ 저장 후 실행

→ python step05_advancestructype.py

→ 다음과 같이 출력된다.

Untitled

Reference : https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html

Posted 2022-04-26Updated 2022-04-29 minkuen python / library / pyspark5 minutes read (About 685 words)

pyspark 실습01

사전준비

spark on windows 참고하여 세팅
스파크를 설치한다.
만약, 파이썬이 처음이라면 **Anaconda**를 설치한다.

pyspark 설치

git bash를 이용해 폴더를 생성하고 터미널을 연다.

바탕화면 우클릭 : git bash here

→ mkdir pyspk_project

→ cd pyspk_project

→ code .

→ git bash 터미널

Untitled

가상환경 생성 후 pyspark 설치

→ virtualenv venv

→ source venv/Scripts/activate

→ pip install pyspark

Untitled

pyspark 실습_1

폴더 파일 생성

→ 폴더 생성 : chapter01_get_started

→ 파일 생성 : step01_basic.py

→ 코드 작성

import pyspark

print(pyspark.__version__)

→ 저장

→ 경로 이동 : cd chapter01_get_started

→ 실행 : python step01_basic.py

Untitled

이어서 코드작성

→ step01_basic.py를 다음과 같이 작성

# -*- coding: utf-8 -*-

import pyspark
print(pyspark.__version__)

from pyspark.sql import SparkSession

# 스파크 세션 초기화
spark = SparkSession.builder.master('local[1]').appName('SampleTutorial').getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

print("rdd Count:", rdd.count())

→ 저장 후 실행

→ 주소창에 입력 : http://localhost:4040/

→ 다음 화면이 출력된다.

교재 278p

Untitled

pyspark 실습_2

슬랙에서 dataset.zip 을 다운로드
압축을 풀고 chapter01_get_started 파일에 복사하여 옮긴다.

Untitled

VSCord에서 작업

→ 파일 생성 : step02_ratings.py

→ 코드 작성

# SparkContext
# RDD

from pyspark import SparkConf, SparkContext
import collections

print("Hello")

def main():
    # MasterNode = local
    # MapReduce

    conf = SparkConf().setMaster('local').setAppName('RatingsHistogram')
    sc = SparkContext(conf = conf)

    lines = sc.textFile("ml-100k/u.logs")
    ratings = lines.map(lambda x: x.split()[2])
    print("ratings: ", ratings)

    result = ratings.countByValue()
    print("result:", result)

    sortedResults = collections.OrderedDict(sorted(result.items()))
    for key, value in sortedResults.items():
        print("%s %i" % (key, value))

if __name__ == "__main__":
    main()

→ 저장

→ 실행 : python step02_ratings.py

→ 다음 결과가 출력된다.

Untitled

VSCord에서 작업

→ 파일 생성 : step03_dataloading.py

→ 코드 작성

→ pip install pandas

# Spark SQL 적용

# Spark Session
from pyspark.sql import SparkSession
import pandas as pd

# 스파크 세션 생성
"""
my_spark = SparkSession.builder.getOrCreate()
print(my_spark)

# 테이블을 확인하는 코드
print(my_spark.catalog.listDatabases())

# show database
my_spark.sql('show databases').show()

# 현재 DB 확인
my_spark.catalog.currentDatabase()
my_spark.stop()
"""

# CSV 파일 불러오기
spark = SparkSession.builder.master('local[1]').appName('DBTutorial').getOrCreate()
flights = spark.read.option('header', 'true').csv('data/flight_small.csv')
# flights.show(4)

# spark.catalog.currentDatabase()
# flights 테이블을 default DB에 추가함
flights.createOrReplaceTempView('flights')

# print(spark.catalog.listTables('default'))
# spark.sql('show tables from default').show()

# 쿼리 통해서 데이터 저장
query = "FROM Fligths SELECT * LIMIT 10"
query2 = "SELECT * FROM flights 10"

# 스파크에 세션할당
flights10 = spark.sql(query2)
flights10.show()

# Spark 데이터 프레임을 Pandas 데이터 프레임을 변환
pd_flights10 = flights10.toPandas()
print(pd_flights10.head())

→ 저장

→ 실행 : python step03_dataloading.py

Reference
- 실무 예제로 배우는 데이터
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html

Posted 2022-03-26Updated 2022-03-26 minkuen python / library / visualization12 minutes read (About 1765 words)

visualization_tutorial_01

데이터 시각화

데이터 시각화의 기본 조건

목적에 맞는 선정
- 선형 그래프, 막대 그래프, 산점도, 박스플롯 etc
환경에 맞는 도구 선택
- 코드 기반(R, Python)
- 프로그램 기반 (시각화 툴)
  - Powr BI, Tableau, Excel
문맥(도메인)에 맞는 색과 도형 사용
- 회사 로고 색깔
- 색상의 일반적인 통념
  - 빨간색(경고), 초록색(안전)
코드 기반의 장점
- 재현성 (함수화)
  - 여러 그래프 동시 작성 가능
  - 기존 코드 Ctrl + C/V
  - 데이터 크기 제한 없음 (RAM 조건 충족 시)
Matplotlib 사용시 주의점
- 객체 지향 API 문법을 사용하라
  - 숙달 시 다른 곳에도 사용 가능하다.
- pyplot API 문법 사용은 자제하라.
  - 숙달해도 다른 문법과 차이가 있어서 쓸 데가 없다.

참고

데이터 분석 강의안_220307.pdf 63페이지. ‘시각화’
Matplotlib : https://matplotlib.org/stable/api/ticker_api.html
seaborn : https://seaborn.pydata.org/
국새 파이썬 시각화 TOP 블로그 Pega Devlog (jehyunlee.github.io)
- 이 분 블로그는 정독할 필요가 있으니, 즐겨찾기 해두자.

라이브러리 불러오기

import matplotlib
import seaborn as sns
print(matplotlib.__version__)
print(sns.__version__)

3.2.2
0.11.2

시각화 그려보기

import matplotlib.pyplot as plt

dates = [
    '2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
    '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10'
]
min_temperature = [20.7, 17.9, 18.8, 14.6, 15.8, 15.8, 15.8, 17.4, 21.8, 20.0]
max_temperature = [34.7, 28.9, 31.8, 25.6, 28.8, 21.8, 22.8, 28.4, 30.8, 32.0]

# 앞으로 본인이 아래와 같이 코드를 작성해야 한다. 디폴트이므로 쓰고 보자.
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize=(10,6))

ax.plot(dates, min_temperature, label = "Min Temp.")
ax.plot(dates, max_temperature, label = "Max Temp.")
ax.legend()
plt.show()

png

주섹 데이터 다운로드 받기

1	!pip install yfinance --upgrade --no-cache-dir

Collecting yfinance
  Downloading yfinance-0.1.70-py2.py3-none-any.whl (26 kB)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from yfinance) (1.3.5)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.7/dist-packages (from yfinance) (1.21.5)
Requirement already satisfied: multitasking>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from yfinance) (0.0.10)
Collecting lxml>=4.5.1
  Downloading lxml-4.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 9.7 MB/s 
[?25hCollecting requests>=2.26
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 41.2 MB/s 
[?25hRequirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->yfinance) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->yfinance) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.24.0->yfinance) (1.15.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26->yfinance) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26->yfinance) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26->yfinance) (1.24.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests>=2.26->yfinance) (2.0.12)
Installing collected packages: requests, lxml, yfinance
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: lxml
    Found existing installation: lxml 4.2.6
    Uninstalling lxml-4.2.6:
      Successfully uninstalled lxml-4.2.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
Successfully installed lxml-4.8.0 requests-2.27.1 yfinance-0.1.70

import yfinance as yf
data = yf.download("AAPL", start="2019-08-01", end="2020-08-01")
ts = data['Open']
print(ts.head())
print(type(ts))   # Series 타입니다.

[*********************100%***********************]  1 of 1 completed
Date
2019-08-01    53.474998
2019-08-02    51.382500
2019-08-05    49.497501
2019-08-06    49.077499
2019-08-07    48.852501
Name: Open, dtype: float64
<class 'pandas.core.series.Series'>

pyplot 형태

import matplotlib.pyplot as plt
plt.plot(ts)
plt.title("")
plt.title("Stock Market of APL")
plt.xlabel("Date")
plt.ylabel("Open Pric")
plt.show()

png

객체지향으로 그리기

fix 는 테두리
나머지는 ax가 표현

import matplotlib.pyplot as plt

fix, ax = plt.subplots()
ax.plot(ts)
#ax.title("Stock Market of APL")
#ax.xlabel("Date")
#ax.ylabel("Open Pric")
plt.show()

png

막대 그래프

import matplotlib.pyplot as plt
import numpy as np
import calendar

month_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
sold_list = [300, 400, 550, 900, 600, 960, 900, 910, 800, 700, 550, 450]

fix, ax = plt.subplots(figsize = (10, 6))
barplots = ax.bar(month_list, sold_list) # bar() 로 막대 그래프 생성

print("barplots : ", barplots)

for plot in barplots:
  print(plot)
  #print(plot.get_height())
  #print(plot.get_x())
  #print(plot.get_y())
  #print(plot.get_width())
  height = plot.get_height()
  ax.text(plot.get_x() + plot.get_width()/2., height, height, ha = 'center', va = 'bottom')  

plt.xticks(month_list, calendar.month_name[1:13], rotation = 90)
plt.show()

barplots :  <BarContainer object of 12 artists>
Rectangle(xy=(0.6, 0), width=0.8, height=300, angle=0)
Rectangle(xy=(1.6, 0), width=0.8, height=400, angle=0)
Rectangle(xy=(2.6, 0), width=0.8, height=550, angle=0)
Rectangle(xy=(3.6, 0), width=0.8, height=900, angle=0)
Rectangle(xy=(4.6, 0), width=0.8, height=600, angle=0)
Rectangle(xy=(5.6, 0), width=0.8, height=960, angle=0)
Rectangle(xy=(6.6, 0), width=0.8, height=900, angle=0)
Rectangle(xy=(7.6, 0), width=0.8, height=910, angle=0)
Rectangle(xy=(8.6, 0), width=0.8, height=800, angle=0)
Rectangle(xy=(9.6, 0), width=0.8, height=700, angle=0)
Rectangle(xy=(10.6, 0), width=0.8, height=550, angle=0)
Rectangle(xy=(11.6, 0), width=0.8, height=450, angle=0)

png

### 산점도

import seaborn as sns

tips = sns.load_dataset("tips") # 영수증 데이터이다.
# print(tips.info())
x = tips['total_bill']
y = tips['tip']

# 산점도
flg, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x,y)
ax.set_xlabel('Total Bill')
ax.set_ylabel('Tip')
plt.show

<function matplotlib.pyplot.show>

png

label, data = tips.groupby('sex')
# print(label)
# print(data)

tips['sex_color'] = tips['sex'].map({'Male': '#2521F6', 'Female': '#EB4036'})
#print(tips.head())

fix, ax = plt.subplots(figsize=(10,6))
for label, data in tips.groupby('sex'):
  ax.scatter(data['total_bill'], data['tip'], label=label, color=data['sex_color'],alpha=0.5)
  ax.set_xlabel('Total Bill')
  ax.set_ylabel('Tip')

ax.legend() # 범례
plt.show()

png

Seaborn

다음 코드는 위와 같은 결과가 나온다. 하지만 더 간단하다.

import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")
# print(tips.info())

fig, ax =plt.subplots(figsize=(10,6))
sns.scatterplot(x='total_bill', y = 'tip', hue='sex', data = tips)
plt.show()

png

# 두 개의 그래프를 동시에 표현
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))

sns.regplot(x = "total_bill", y = "tip", data = tips, ax=ax[1], fit_reg = True)
ax[1].set_title("with linear regression line")

sns.regplot(x = "total_bill", y = "tip", data = tips, ax=ax[0], fit_reg = False)
ax[0].set_title("without linear regression line")

plt.show()

png

막대 그래프 그리기 seaborn 방식

1 2	sns.countplot(x="day", data=tips) plt.show()

png

1
2
3

print(tips['day'].value_counts().index)
print(tips['day'].value_counts().values)
print(tips['day'].value_counts(ascending=True))

CategoricalIndex(['Sat', 'Sun', 'Thur', 'Fri'], categories=['Thur', 'Fri', 'Sat', 'Sun'], ordered=False, dtype='category')
[87 76 62 19]
Fri     19
Thur    62
Sun     76
Sat     87
Name: day, dtype: int64

flg, ax = plt.subplots()
ax = sns.countplot(x="day", data=tips, order = tips['day'].value_counts().index)

for plot in ax.patches: # matplotlib 와 같은 역할을 수행한다.
  print(plot)
  height = plot.get_height()
  ax.text(plot.get_x() + plot.get_width()/2., height, height, ha = 'center', va = 'bottom')  

ax.set_ylim(-5, 100)
plt.show()

Rectangle(xy=(-0.4, 0), width=0.8, height=87, angle=0)
Rectangle(xy=(0.6, 0), width=0.8, height=76, angle=0)
Rectangle(xy=(1.6, 0), width=0.8, height=62, angle=0)
Rectangle(xy=(2.6, 0), width=0.8, height=19, angle=0)

png

어려운 시각화 그래프

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator, FuncFormatter)

def major_formatter(x, pos):
  return "%.2f$" % x

formatter = FuncFormatter(major_formatter)

tips = sns.load_dataset("tips")
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))


ax0 = sns.barplot(x="day", y="total_bill", data=tips,
            ci=None, color='lightgray', alpha=0.85, zorder=2,  # alpha는 투명도
            ax = ax[0])
# groupby
group_mean = tips.groupby(['day'])['total_bill'].agg('mean')
#print(group_mean)

h_day = group_mean.sort_values(ascending=False).index[0]  # sun 표시
#print(h_day)
h_mean = group_mean.sort_values(ascending=False).values[0]  
print(h_mean)

# text 추가
for plot in ax0.patches:
  height = np.round(plot.get_height(), 2)
  # print(height)

  # Default
  fontweight = "normal"
  color = "k"
  if h_mean == height:
    fontweight = "bold"
    color = "darkred"
    plot.set_facecolor(color)
    plot.set_edgecolor("black")

  ax0.text(plot.get_x() + plot.get_width()/2.,
           height + 1, height, 
           ha = 'center', size=12, fontweight = fontweight, color = color)

# 축 수정
ax0.set_ylim(-3, 30)
ax0.set_title("Bar Graph", size = 16)

# 테두리(spines) 삭제
ax0.spines['top'].set_visible(False)
ax0.spines['left'].set_position(("outward", 20))
ax0.spines['left'].set_visible(False)
ax0.spines['right'].set_visible(False)

ax0.yaxis.set_major_locator(MultipleLocator(10))
ax0.yaxis.set_major_formatter(formatter)
ax0.yaxis.set_minor_locator(MultipleLocator(5))

ax0.set_ylabel("Avg. Total Bill($)", fontsize=14)

ax0.grid(axis="y", which="major", color = "lightgray")
ax0.grid(axis="y", which="major", ls = ":")

for xtick in ax0.get_xticklabels():
  print(xtick)
  if xtick.get_text() == h_day:
    xtick.set_color("darkred")
    xtick.set_fontweight("demibold")

ax0.set_xticklabels(['Thursday', 'Friday', 'Saturday', 'Sunday'], size = 12)

plt.show()

21.41
Text(0, 0, 'Thur')
Text(0, 0, 'Fri')
Text(0, 0, 'Sat')
Text(0, 0, 'Sun')

png

Posted 2022-03-24Updated 2022-03-24 minkuen python / library / pandas25 minutes read (About 3797 words)

pandas_tutorial_02

라이브러리 불러오기

1 2	import pandas as pd print(pd.__version__)

1.3.5

구글 드라이브 연동

구글 드라이브 → colab notebook → 새 폴더 생성 : data → 슬랙에서 다운 받은 lemonade.csv 파일을 올린다 -> 다음 코드를 실행

1 2	from google.colab import drive drive.mount('/content/drive')

Mounted at /content/drive

Mounted at ..drive 가 출력되었으므로 성공

현재 좌측에 폴더 그림 -> drive -> mydrive -> Colab Notebooks -> data -> supermarket_sales.csv를 찾아서 우클릭 -> 경로 복사 -> 다음 코드에 붙여넣어 사용

1
2
3

DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/data/supermarket_sales.csv'
sales = pd.read_csv(DATA_PATH)
sales

	Invoice ID	Branch	City	Customer type	Gender	Product line	Unit price	Quantity	Date	Time	Payment
0	750-67-8428	A	Yangon	Member	Female	Health and beauty	74.69	7	1/5/2019	13:08	Ewallet
1	226-31-3081	C	Naypyitaw	Normal	Female	Electronic accessories	15.28	5	3/8/2019	10:29	Cash
2	631-41-3108	A	Yangon	Normal	Male	Home and lifestyle	46.33	7	3/3/2019	13:23	Credit card
3	123-19-1176	A	Yangon	Member	Male	Health and beauty	58.22	8	1/27/2019	20:33	Ewallet
4	373-73-7910	A	Yangon	Normal	Male	Sports and travel	86.31	7	2/8/2019	10:37	Ewallet
...	...	...	...	...	...	...	...	...	...	...	...
995	233-67-5758	C	Naypyitaw	Normal	Male	Health and beauty	40.35	1	1/29/2019	13:46	Ewallet
996	303-96-2227	B	Mandalay	Normal	Female	Home and lifestyle	97.38	10	3/2/2019	17:16	Ewallet
997	727-02-1313	A	Yangon	Member	Male	Food and beverages	31.84	1	2/9/2019	13:22	Cash
998	347-56-2442	A	Yangon	Normal	Male	Home and lifestyle	65.82	1	2/22/2019	15:33	Cash
999	849-09-3807	A	Yangon	Member	Female	Fashion accessories	88.34	7	2/18/2019	13:28	Cash

1000 rows × 11 columns

  <script>
    const buttonEl =
      document.querySelector('#df-8a1e46d8-83ea-49d2-a98d-cf274f10b34d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-8a1e46d8-83ea-49d2-a98d-cf274f10b34d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Invoice ID     1000 non-null   object 
 1   Branch         1000 non-null   object 
 2   City           1000 non-null   object 
 3   Customer type  1000 non-null   object 
 4   Gender         1000 non-null   object 
 5   Product line   1000 non-null   object 
 6   Unit price     1000 non-null   float64
 7   Quantity       1000 non-null   int64  
 8   Date           1000 non-null   object 
 9   Time           1000 non-null   object 
 10  Payment        1000 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 86.1+ KB

Group by

(동의어) 집계함수를 배운다.

1 2	# 여러가지 시도해보면서 정보를 파악해보자 sales['Invoice ID'].value_counts()

750-67-8428    1
642-61-4706    1
816-72-8853    1
491-38-3499    1
322-02-2271    1
              ..
633-09-3463    1
374-17-3652    1
378-07-7001    1
433-75-6987    1
849-09-3807    1
Name: Invoice ID, Length: 1000, dtype: int64

1 2	# 여러가지 시도해보면서 정보를 파악해보자 sales.groupby('Customer type')['Quantity'].sum()

Customer type
Member    2785
Normal    2725
Name: Quantity, dtype: int64

1	sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].sum()

Customer type  Branch  Payment    
Member         A       Cash           308
                       Credit card    282
                       Ewallet        374
               B       Cash           284
                       Credit card    371
                       Ewallet        269
               C       Cash           293
                       Credit card    349
                       Ewallet        255
Normal         A       Cash           264
                       Credit card    298
                       Ewallet        333
               B       Cash           344
                       Credit card    228
                       Ewallet        324
               C       Cash           403
                       Credit card    194
                       Ewallet        337
Name: Quantity, dtype: int64

data type은 Series 이다.

1	print(type(sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].sum()))

<class 'pandas.core.series.Series'>

검색 키워드를 잘 선택하는게 중요하다.

1	sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].agg(['sum', 'mean'])

			sum	mean
Customer type	Branch	Payment
Member	A	Cash	308	5.500000
		Credit card	282	5.755102
		Ewallet	374	6.032258
	B	Cash	284	5.358491
		Credit card	371	5.888889
		Ewallet	269	5.489796
	C	Cash	293	4.966102
		Credit card	349	5.816667
		Ewallet	255	5.100000
Normal	A	Cash	264	4.888889
		Credit card	298	5.418182
		Ewallet	333	5.203125
	B	Cash	344	6.035088
		Credit card	228	4.956522
		Ewallet	324	5.062500
	C	Cash	403	6.200000
		Credit card	194	5.105263
		Ewallet	337	6.017857

  <script>
    const buttonEl =
      document.querySelector('#df-9f19e00c-ea81-404c-b289-c9ddb325aeaa button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-9f19e00c-ea81-404c-b289-c9ddb325aeaa');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	print(type(sales.groupby(['Customer type', 'Branch', 'Payment'])['Quantity'].agg(['sum', 'mean'])))

<class 'pandas.core.frame.DataFrame'>

1	sales.groupby(['Customer type', 'Branch', 'Payment'], as_index=False)['Quantity'].agg(['sum', 'mean'])

			sum	mean
Customer type	Branch	Payment
Member	A	Cash	308	5.500000
		Credit card	282	5.755102
		Ewallet	374	6.032258
	B	Cash	284	5.358491
		Credit card	371	5.888889
		Ewallet	269	5.489796
	C	Cash	293	4.966102
		Credit card	349	5.816667
		Ewallet	255	5.100000
Normal	A	Cash	264	4.888889
		Credit card	298	5.418182
		Ewallet	333	5.203125
	B	Cash	344	6.035088
		Credit card	228	4.956522
		Ewallet	324	5.062500
	C	Cash	403	6.200000
		Credit card	194	5.105263
		Ewallet	337	6.017857

  <script>
    const buttonEl =
      document.querySelector('#df-f56f9b0d-43e2-4ba0-8abf-56fa96c5d20f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f56f9b0d-43e2-4ba0-8abf-56fa96c5d20f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

결측치 다루기

결측치 데이터 생성
- 임의로 여러가지 생성해보자 (숙달 과정)

import pandas as pd
import numpy as np

dict_01 = {
    'Score_A' : [80, 90, np.nan, 80],
    'Score_B' : [30, 45, np.nan, np.nan],
    'Score_C' : [np.nan, 50, 80, 90],
}

df = pd.DataFrame(dict_01)
df

	Score_A	Score_B	Score_C
0	80.0	30.0	NaN
1	90.0	45.0	50.0
2	NaN	NaN	80.0
3	80.0	NaN	90.0

  <script>
    const buttonEl =
      document.querySelector('#df-6c9ddc3e-23cb-46c2-bcca-8e0adaab788b button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-6c9ddc3e-23cb-46c2-bcca-8e0adaab788b');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

True = 숫자 1로 인식
False = 숫자 0으로 인식
결측치 (Nan) 개수 세기

1	df.isnull().sum()

Score_A    1
Score_B    2
Score_C    1
dtype: int64

결측치를 다른 것으로 채우기

1	df.fillna("0")

	Score_A	Score_B	Score_C
0	80.0	30.0	0
1	90.0	45.0	50.0
2	0	0	80.0
3	80.0	0	90.0

  <script>
    const buttonEl =
      document.querySelector('#df-ce166771-c2da-430d-aa22-8b7cac811d94 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-ce166771-c2da-430d-aa22-8b7cac811d94');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1 2	# 바로 윗칸의 데이터로 채우기 df.fillna(method="pad")

	Score_A	Score_B	Score_C
0	80.0	30.0	NaN
1	90.0	45.0	50.0
2	90.0	45.0	80.0
3	80.0	45.0	90.0

  <script>
    const buttonEl =
      document.querySelector('#df-14c34a17-7745-4466-a779-62f00b5030de button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-14c34a17-7745-4466-a779-62f00b5030de');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

dict_01 = {
    "성별" : ["남자", "여자", np.nan, "남자"],
    "Salary" : [30, 45, 90, 70],
}

df = pd.DataFrame(dict_01)
df

	성별	Salary
0	남자	30
1	여자	45
2	NaN	90
3	남자	70

  <script>
    const buttonEl =
      document.querySelector('#df-54d9e838-5824-4cb9-9f0f-8291411d9270 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-54d9e838-5824-4cb9-9f0f-8291411d9270');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	df['성별'].fillna("성별 없음")

0       남자
1       여자
2    성별 없음
3       남자
Name: 성별, dtype: object

결측치

–> 문자열 타입이랑 / 숫자 타입이랑 접근 방법이 다름
–> 문자열(빈도 –> 가장 많이 나타나는 문자열 넣어주기!, 최빈값)
–> 숫자열(평균, 최대, 최소, 중간, 기타 등등..)

import pandas as pd
import numpy as np

dict_01 = {
    'Score_A' : [80, 90, np.nan, 80],
    'Score_B' : [30, 45, np.nan, np.nan],
    'Score_C' : [np.nan, 50, 80, 90],
    'Score_D' : [50, 30, 80, 60],
}

df = pd.DataFrame(dict_01)
df

	Score_A	Score_B	Score_C	Score_D
0	80.0	30.0	NaN	50
1	90.0	45.0	50.0	30
2	NaN	NaN	80.0	80
3	80.0	NaN	90.0	60

  <script>
    const buttonEl =
      document.querySelector('#df-a1c2a0ad-c902-4c13-8d35-ae4931ac7c3d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-a1c2a0ad-c902-4c13-8d35-ae4931ac7c3d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

결측치가 있을 때 열을 지운다.
- axis = 1 -> columns

1	df.dropna(axis = 1)

	Score_D
0	50
1	30
2	80
3	60

  <script>
    const buttonEl =
      document.querySelector('#df-a9658aae-24a1-43bd-a6fe-73b30c752e90 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-a9658aae-24a1-43bd-a6fe-73b30c752e90');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

결측치가 있을 때 행을 지운다.
- axis = 0 -> index

1	df.dropna(axis = 0)

	Score_A	Score_B	Score_C	Score_D
1	90.0	45.0	50.0	30

  <script>
    const buttonEl =
      document.querySelector('#df-1359a925-a759-4d30-9b57-10abcaf3af1a button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-1359a925-a759-4d30-9b57-10abcaf3af1a');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

이상치

sales

	Invoice ID	Branch	City	Customer type	Gender	Product line	Unit price	Quantity	Date	Time	Payment
0	750-67-8428	A	Yangon	Member	Female	Health and beauty	74.69	7	1/5/2019	13:08	Ewallet
1	226-31-3081	C	Naypyitaw	Normal	Female	Electronic accessories	15.28	5	3/8/2019	10:29	Cash
2	631-41-3108	A	Yangon	Normal	Male	Home and lifestyle	46.33	7	3/3/2019	13:23	Credit card
3	123-19-1176	A	Yangon	Member	Male	Health and beauty	58.22	8	1/27/2019	20:33	Ewallet
4	373-73-7910	A	Yangon	Normal	Male	Sports and travel	86.31	7	2/8/2019	10:37	Ewallet
...	...	...	...	...	...	...	...	...	...	...	...
995	233-67-5758	C	Naypyitaw	Normal	Male	Health and beauty	40.35	1	1/29/2019	13:46	Ewallet
996	303-96-2227	B	Mandalay	Normal	Female	Home and lifestyle	97.38	10	3/2/2019	17:16	Ewallet
997	727-02-1313	A	Yangon	Member	Male	Food and beverages	31.84	1	2/9/2019	13:22	Cash
998	347-56-2442	A	Yangon	Normal	Male	Home and lifestyle	65.82	1	2/22/2019	15:33	Cash
999	849-09-3807	A	Yangon	Member	Female	Fashion accessories	88.34	7	2/18/2019	13:28	Cash

1000 rows × 11 columns

  <script>
    const buttonEl =
      document.querySelector('#df-04da7866-a736-4456-8d77-a7760df771c5 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-04da7866-a736-4456-8d77-a7760df771c5');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

일반적인 통계적 공식
IQR - 박스플롯 - 사분위수
Q0(0), Q1(25%), Q2(50%), Q3(75%), Q4(100%)
이상치의 하한 경계값 : Q1 - 1.5 * (Q3-Q1)
이상치의 상한 경계값 : Q3 + 1.5 * (Q3-Q1)
도메인 (각 비즈니스 영역, 미래 일자리) 에서 바라보는 이상치 기준 (관습)

1	sales[['Unit price']]. describe()

	Unit price
count	1000.000000
mean	55.672130
std	26.494628
min	10.080000
25%	32.875000
50%	55.230000
75%	77.935000
max	99.960000

  <script>
    const buttonEl =
      document.querySelector('#df-6bdae016-4a2d-4f55-9e51-5f68e6af7217 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-6bdae016-4a2d-4f55-9e51-5f68e6af7217');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

이상치의 하한 경계값 : Q1 - 1.5 * (Q3-Q1)
이런 공식은 통계적으로 타당하지만 그 외에도 이상치인지 판단할 방법이 있다.

Q1 = sales['Unit price'].quantile(0.25)
Q3 = sales['Unit price'].quantile(0.75)

# Q1보다 낮은 값을 이상치로 간주
outliers_q1 = (sales['Unit price'] < Q1)

# Q1보다 높은 값을 이상치로 간주
outliers_q3 = (sales['Unit price'] > Q3)

이 코드는 특히 중요하다

1	print(sales['Unit price'][~(outliers_q1 \| outliers_q3)])

0      74.69
2      46.33
3      58.22
6      68.84
7      73.56
       ...  
991    76.60
992    58.03
994    60.95
995    40.35
998    65.82
Name: Unit price, Length: 500, dtype: float64

Posted 2022-03-24Updated 2022-03-24 minkuen python / library / pandas2 hours read (About 13518 words)

pandas_10minutes

Pandas 10분 완성

https://dataitgirls2.github.io/10minutes2pandas/

# 라이브러리 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1.Object Creation (객체 생성)

Pandas는 값을 가지고 있는 리스트를 통해 Series를 만들고, 정수로 만들어진 인덱스를 기본값으로 불러온다.

1
2
3

# Series를 이용한 객체 생성
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

datetime 인덱스와 레이블이 있는 열을 가지고 있는 numpy 배열을 전달하여 데이터프레임을 만든다.

1
2
3

# date_range()를 이용해 20130101을 포함한 연속적인 6일의 데이터를 넣는다.
dates = pd.date_range('20130101', periods = 6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

1
2
3

# 데이터 프레임 생성
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

	A	B	C	D
2013-01-01	-0.214371	-0.489334	0.807876	-2.328570
2013-01-02	-0.018762	-0.438046	0.593880	0.671849
2013-01-03	-0.596207	0.081615	0.182117	-2.063007
2013-01-04	-2.044753	-0.853425	1.582471	-0.756233
2013-01-05	0.394973	-0.526762	0.393856	1.550660
2013-01-06	-1.665879	0.184903	1.905710	2.345500

  <script>
    const buttonEl =
      document.querySelector('#df-98ec8384-9a3f-4ee3-9d62-d8f6d1821857 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-98ec8384-9a3f-4ee3-9d62-d8f6d1821857');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Series와 같은 것으로 변환될 수 있는 객체들의 dict로 구성된 데이터프레임을 만든다.

df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D' : np.array([3] * 4,dtype = 'int32'),
                    'E' : pd.Categorical(["test", "train", "test", "train"]),
                    'F' : 'foo'})

df2

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

  <script>
    const buttonEl =
      document.querySelector('#df-32a9a2b4-301b-48af-8afa-569444b4838a button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-32a9a2b4-301b-48af-8afa-569444b4838a');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

데이터프레임 결과물의 열은 다양한 데이터 타입 (dtypes) 으로 구성

1	df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

2.Viewing Data (데이터 확인하기)

데이터프레임의 가장 윗 줄과 마지막 줄을 확인하고 싶을 때에 사용하는 방법은 다음과 같다.

1 2	df.tail(3) # 끝에서부터 3줄을 출력 df.tail() # 끝에서부터 5줄을 출력. 디폴트값이 5이다

	A	B	C	D
2013-01-02	-0.018762	-0.438046	0.593880	0.671849
2013-01-03	-0.596207	0.081615	0.182117	-2.063007
2013-01-04	-2.044753	-0.853425	1.582471	-0.756233
2013-01-05	0.394973	-0.526762	0.393856	1.550660
2013-01-06	-1.665879	0.184903	1.905710	2.345500

  <script>
    const buttonEl =
      document.querySelector('#df-b90adfec-6c56-41b4-a15c-699d26a51033 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-b90adfec-6c56-41b4-a15c-699d26a51033');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	df.head() # 처음 5줄을 출력.

	A	B	C	D
2013-01-01	-0.214371	-0.489334	0.807876	-2.328570
2013-01-02	-0.018762	-0.438046	0.593880	0.671849
2013-01-03	-0.596207	0.081615	0.182117	-2.063007
2013-01-04	-2.044753	-0.853425	1.582471	-0.756233
2013-01-05	0.394973	-0.526762	0.393856	1.550660

  <script>
    const buttonEl =
      document.querySelector('#df-1d5aaccf-3123-46cc-a71a-b7f235656a2f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-1d5aaccf-3123-46cc-a71a-b7f235656a2f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

인덱스(index), 열(column) 그리고 numpy 데이터에 대한 세부 정보를 표시

df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

1	df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.values

array([[-0.21437119, -0.48933404,  0.80787593, -2.32856993],
       [-0.01876194, -0.43804563,  0.59387975,  0.67184854],
       [-0.59620717,  0.08161493,  0.18211706, -2.06300731],
       [-2.0447528 , -0.85342539,  1.58247067, -0.75623263],
       [ 0.39497306, -0.52676189,  0.39385602,  1.55066002],
       [-1.66587853,  0.18490331,  1.9057098 ,  2.34549952]])

describe()는 데이터의 대략적인 통계적 정보 요약을 보여준다.

1	df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.690833	-0.340175	0.910985	-0.096634
std	0.964410	0.395899	0.685599	1.926208
min	-2.044753	-0.853425	0.182117	-2.328570
25%	-1.398461	-0.517405	0.443862	-1.736314
50%	-0.405289	-0.463690	0.700878	-0.042192
75%	-0.067664	-0.048300	1.388822	1.330957
max	0.394973	0.184903	1.905710	2.345500

  <script>
    const buttonEl =
      document.querySelector('#df-bc50d2c1-8824-4181-b8ce-c7aecf4a30e1 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-bc50d2c1-8824-4181-b8ce-c7aecf4a30e1');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

데이터를 전치한다.
즉, 두 축을 서로 바꾼다

df.T

	2013-01-01	2013-01-02	2013-01-03	2013-01-04	2013-01-05	2013-01-06
A	-0.214371	-0.018762	-0.596207	-2.044753	0.394973	-1.665879
B	-0.489334	-0.438046	0.081615	-0.853425	-0.526762	0.184903
C	0.807876	0.593880	0.182117	1.582471	0.393856	1.905710
D	-2.328570	0.671849	-2.063007	-0.756233	1.550660	2.345500

  <script>
    const buttonEl =
      document.querySelector('#df-cfc48a53-c108-4b41-bf62-f85fc6923d20 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-cfc48a53-c108-4b41-bf62-f85fc6923d20');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

축 별로 정렬한다.

1 2	df.sort_index(axis=1, ascending=False) # axis=1 일때, 열방향으로 실행한다.

	D	C	B	A
2013-01-01	-2.328570	0.807876	-0.489334	-0.214371
2013-01-02	0.671849	0.593880	-0.438046	-0.018762
2013-01-03	-2.063007	0.182117	0.081615	-0.596207
2013-01-04	-0.756233	1.582471	-0.853425	-2.044753
2013-01-05	1.550660	0.393856	-0.526762	0.394973
2013-01-06	2.345500	1.905710	0.184903	-1.665879

  <script>
    const buttonEl =
      document.querySelector('#df-2b34e281-4cce-48fe-b140-c57dc7828ca4 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-2b34e281-4cce-48fe-b140-c57dc7828ca4');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

값 별로 정렬한다.

1	df.sort_values(by='B')

	A	B	C	D
2013-01-04	-2.044753	-0.853425	1.582471	-0.756233
2013-01-05	0.394973	-0.526762	0.393856	1.550660
2013-01-01	-0.214371	-0.489334	0.807876	-2.328570
2013-01-02	-0.018762	-0.438046	0.593880	0.671849
2013-01-03	-0.596207	0.081615	0.182117	-2.063007
2013-01-06	-1.665879	0.184903	1.905710	2.345500

  <script>
    const buttonEl =
      document.querySelector('#df-30e7da7b-4a36-43a6-b2c3-a5a382f252de button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-30e7da7b-4a36-43a6-b2c3-a5a382f252de');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

3.Selection (선택)

주석 (Note) : 선택과 설정을 위한 Python / Numpy의 표준화된 표현들이 직관적이며, 코드 작성을 위한 양방향 작업에 유용하지만 우리는 Pandas에 최적화된 데이터 접근 방법인 .at, .iat, .loc 및 .iloc 을 추천.
Getting (데이터 얻기)
- df.A 와 동일한 Series를 생성하는 단일 열을 선택

df['A']

2013-01-01   -0.214371
2013-01-02   -0.018762
2013-01-03   -0.596207
2013-01-04   -2.044753
2013-01-05    0.394973
2013-01-06   -1.665879
Freq: D, Name: A, dtype: float64

행을 분할하는 [ ]를 통해 선택한다.

df[0:3]

	A	B	C	D
2013-01-01	0.654753	-0.366034	-1.440226	-1.043957
2013-01-02	1.589167	0.321939	1.393342	0.898153
2013-01-03	0.270879	0.107423	-2.032053	1.861947

  <script>
    const buttonEl =
      document.querySelector('#df-42a08968-1f35-45a3-bca9-318f73f513c6 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-42a08968-1f35-45a3-bca9-318f73f513c6');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	df['20130102':'20130104']

	A	B	C	D
2013-01-02	-0.018762	-0.438046	0.593880	0.671849
2013-01-03	-0.596207	0.081615	0.182117	-2.063007
2013-01-04	-2.044753	-0.853425	1.582471	-0.756233

  <script>
    const buttonEl =
      document.querySelector('#df-ebb43e9b-1f8f-4438-93a1-e004bc46108e button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-ebb43e9b-1f8f-4438-93a1-e004bc46108e');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Selection by Label (Label을 통한 선택)

라벨을 사용하여 횡단면을 얻는다.

1	df.loc[dates[0]]

A   -0.214371
B   -0.489334
C    0.807876
D   -2.328570
Name: 2013-01-01 00:00:00, dtype: float64

라벨을 사용하여 여러 축의 데이터를 획득한다.

1	df.loc[:,['A','B']]

	A	B
2013-01-01	-0.214371	-0.489334
2013-01-02	-0.018762	-0.438046
2013-01-03	-0.596207	0.081615
2013-01-04	-2.044753	-0.853425
2013-01-05	0.394973	-0.526762
2013-01-06	-1.665879	0.184903

  <script>
    const buttonEl =
      document.querySelector('#df-0d130634-6f70-4a3b-a931-3907cbf47eb6 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-0d130634-6f70-4a3b-a931-3907cbf47eb6');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

양쪽 종단점을 포함한 라벨 슬라이싱을 표시.

1	df.loc['20130102':'20130104', ['A','B']]

	A	B
2013-01-02	-0.018762	-0.438046
2013-01-03	-0.596207	0.081615
2013-01-04	-2.044753	-0.853425

  <script>
    const buttonEl =
      document.querySelector('#df-80998ed9-78c8-48ef-84e6-b2e58d3420c3 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-80998ed9-78c8-48ef-84e6-b2e58d3420c3');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

반환되는 객체의 차원를 줄인다.

1	df.loc['20130102',['A','B']]

A   -0.018762
B   -0.438046
Name: 2013-01-02 00:00:00, dtype: float64

스칼라 값을 얻는다.

1	df.loc[dates[0], 'A']

-0.21437119207750993

스칼라 값을 더 빠르게 구하는 방법 (앞선 메소드와 동일하다)

1	df.at[dates[0], 'A']

-0.21437119207750993

Selection by Position(위치로 선택하기)

넘겨받은 정수의 위치를 기준으로 선택.

1	df.iloc[3]

A   -2.044753
B   -0.853425
C    1.582471
D   -0.756233
Name: 2013-01-04 00:00:00, dtype: float64

정수로 표기된 슬라이스들을 통해, numpy / python과 유사하게 작동.

1	df.iloc[3:5, 0:2]

	A	B
2013-01-04	-2.044753	-0.853425
2013-01-05	0.394973	-0.526762

  <script>
    const buttonEl =
      document.querySelector('#df-a6c9e7cd-47c2-42bc-8726-b9b8dbf72bb9 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-a6c9e7cd-47c2-42bc-8726-b9b8dbf72bb9');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

정수로 표기된 위치값의 리스트를 통하여 numpy / python 의 스타일과 유사해진다.

1	df.iloc[[1, 2, 4], [0, 2]]

	A	C
2013-01-02	-0.018762	0.593880
2013-01-03	-0.596207	0.182117
2013-01-05	0.394973	0.393856

  <script>
    const buttonEl =
      document.querySelector('#df-1d8279d9-f642-48e7-8d6a-5006bade95b3 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-1d8279d9-f642-48e7-8d6a-5006bade95b3');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

다음은 명시적으로 행을 나누고자 하는 경우이다
- 즉, 한쪽을 공백으로 둘 경우

1	df.iloc[1:3, : ]

	A	B	C	D
2013-01-02	-0.018762	-0.438046	0.593880	0.671849
2013-01-03	-0.596207	0.081615	0.182117	-2.063007

  <script>
    const buttonEl =
      document.querySelector('#df-f8cec5e2-bbc6-4151-9e24-a55e5e4cd2b8 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f8cec5e2-bbc6-4151-9e24-a55e5e4cd2b8');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

다음은 명시적으로 열을 나누고자 하는 경우이다
- 즉, 한쪽을 공백으로 둘 경우

1	df.iloc[ : , 1:3]

	B	C
2013-01-01	-0.489334	0.807876
2013-01-02	-0.438046	0.593880
2013-01-03	0.081615	0.182117
2013-01-04	-0.853425	1.582471
2013-01-05	-0.526762	0.393856
2013-01-06	0.184903	1.905710

  <script>
    const buttonEl =
      document.querySelector('#df-e2e93a02-91f2-43f7-a503-bec4339b83b8 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-e2e93a02-91f2-43f7-a503-bec4339b83b8');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

다음은 명시적으로 (특정한) 값을 얻고자 하는 경우이다.

1	df.iloc[1, 1]

-0.43804562902186034

스칼라 값을 빠르게 얻는 방법 (위의 방식과 동일하다)

1	df.iat[1,1]

-0.43804562902186034

Boolean Indexing

데이터를 선택하기 위해 단일 열의 값을 사용

1	df[df.A > 0]

	A	B	C	D
2013-01-05	0.394973	-0.526762	0.393856	1.55066

  <script>
    const buttonEl =
      document.querySelector('#df-4f4aed32-1929-4051-a6c9-8f593fb92c84 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-4f4aed32-1929-4051-a6c9-8f593fb92c84');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Boolean 조건을 충족하는 데이터프레임에서 값을 선택

1	df[df > 0]

	A	B	C	D
2013-01-01	NaN	NaN	0.807876	NaN
2013-01-02	NaN	NaN	0.593880	0.671849
2013-01-03	NaN	0.081615	0.182117	NaN
2013-01-04	NaN	NaN	1.582471	NaN
2013-01-05	0.394973	NaN	0.393856	1.550660
2013-01-06	NaN	0.184903	1.905710	2.345500

  <script>
    const buttonEl =
      document.querySelector('#df-35a1e446-5c1c-4a6c-b39d-6ff2b7801761 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-35a1e446-5c1c-4a6c-b39d-6ff2b7801761');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

isin

필터링을 위한 메소드이다.

1
2
3

df2 = df.copy()
df2 = ['one', 'one', 'two', 'three', 'four', 'three']
df2

1	# df2[df2['E'].isin(['two','four'])]

Setting (설정)

새 열을 설정하면 데이터가 인덱스 별로 자동 정렬된다.

1 2	s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

1	df['F'] = s1

라벨에 의해 값을 설정한다.

1	df.at[dates[0], 'A'] = 0

위치에 의해 값을 설정한다.

1	df.iat[0, 1] = 0

Numpy 배열을 사용한 할당에 의해 값을 설정한다.

1	df.loc[:, 'D'] = np.array([5] * len(df))

위 설정대로 작동한 결과다.

df

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.807876	5	NaN
2013-01-02	-0.018762	-0.438046	0.593880	5	1.0
2013-01-03	-0.596207	0.081615	0.182117	5	2.0
2013-01-04	-2.044753	-0.853425	1.582471	5	3.0
2013-01-05	0.394973	-0.526762	0.393856	5	4.0
2013-01-06	-1.665879	0.184903	1.905710	5	5.0

  <script>
    const buttonEl =
      document.querySelector('#df-12bdced2-022e-4cec-87a4-12165a361bd7 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-12bdced2-022e-4cec-87a4-12165a361bd7');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

where 연산을 설정합니다.

1
2
3

df2 = df.copy()
df2[df2 > 0] = -df2
df2

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-0.807876	-5	NaN
2013-01-02	-0.018762	-0.438046	-0.593880	-5	-1.0
2013-01-03	-0.596207	-0.081615	-0.182117	-5	-2.0
2013-01-04	-2.044753	-0.853425	-1.582471	-5	-3.0
2013-01-05	-0.394973	-0.526762	-0.393856	-5	-4.0
2013-01-06	-1.665879	-0.184903	-1.905710	-5	-5.0

  <script>
    const buttonEl =
      document.querySelector('#df-2e3a7cd9-4958-46ba-a9ca-9fb4031816c0 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-2e3a7cd9-4958-46ba-a9ca-9fb4031816c0');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

4.Missing Data (결측치)

Pandas는 결측치를 표현하기 위해 주로 np.nan 값을 사용한다.
- 이 방법은 기본 설정값이지만 계산에는 포함되지 않는다.
Reindexing으로 지정된 축 상의 인덱스를 변경 / 추가 / 삭제 가능. Reindexing은 데이터의 복사본을 반환.

1
2
3

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
df1

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	0.807876	5	NaN	1.0
2013-01-02	-0.018762	-0.438046	0.593880	5	1.0	1.0
2013-01-03	-0.596207	0.081615	0.182117	5	2.0	NaN
2013-01-04	-2.044753	-0.853425	1.582471	5	3.0	NaN

  <script>
    const buttonEl =
      document.querySelector('#df-bd5a580c-316a-44a4-b5af-496ed1ad2f48 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-bd5a580c-316a-44a4-b5af-496ed1ad2f48');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

결측치를 가지고 있는 행들을 지운다.

1	df1.dropna(how = 'any')

	A	B	C	D	F	E
2013-01-02	-0.018762	-0.438046	0.59388	5	1.0	1.0

  <script>
    const buttonEl =
      document.querySelector('#df-e2491041-b1d4-4902-9dce-0ef94792d204 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-e2491041-b1d4-4902-9dce-0ef94792d204');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

결측치를 채워 넣는다.

1	df1.fillna(value=5)

	A	B	C	D	F	E
2013-01-01	0.000000	0.000000	0.807876	5	5.0	1.0
2013-01-02	-0.018762	-0.438046	0.593880	5	1.0	1.0
2013-01-03	-0.596207	0.081615	0.182117	5	2.0	5.0
2013-01-04	-2.044753	-0.853425	1.582471	5	3.0	5.0

  <script>
    const buttonEl =
      document.querySelector('#df-d9ef038f-e4ba-4702-b903-2a4056ec371e button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-d9ef038f-e4ba-4702-b903-2a4056ec371e');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

nan인 값에 boolean을 통한 표식을 얻는다.
데이터프레임의 모든 값이 boolean 형태로 표시되며, nan 값에만 True를 표시한다.

1	pd.isna(df1)

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

  <script>
    const buttonEl =
      document.querySelector('#df-5ac60ecc-0098-4e10-8e23-74a0aa20e121 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5ac60ecc-0098-4e10-8e23-74a0aa20e121');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

5.Operation (연산)

Stats (통계)

일반적으로 결측치를 제외한 후 연산.
기술통계를 수행한다.

df.mean()

A   -0.655105
B   -0.258619
C    0.910985
D    5.000000
F    3.000000
dtype: float64

다른 축에서도 동일한 연산을 수행.

1	df.mean(1)

2013-01-01    1.451969
2013-01-02    1.227414
2013-01-03    1.333505
2013-01-04    1.336858
2013-01-05    1.852413
2013-01-06    2.084947
Freq: D, dtype: float64

정렬이 필요하다. 차원이 다른 객체로 연산해보자.
pandas는 지정된 차원을 따라 자동으로 브로드 캐스팅된다.
- broadcast란 n차원이나 스칼라 값으로 연산을 수행할 때 도출되는 결과의 규칙을 설명하는 것을 의미

1 2	s = pd.Series([1, 3, 4, np.nan, 6, 8], index=dates) s

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    4.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64

위 코드를 shift로 2칸 옮긴 것

1 2	s = pd.Series([1, 3, 4, np.nan, 6, 8], index=dates).shift(2) s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    4.0
2013-01-06    NaN
Freq: D, dtype: float64

index를 축(axis)으로 실행

1	df.sub(s, axis='index')

	A	B	C	D	F
2013-01-01	NaN	NaN	NaN	NaN	NaN
2013-01-02	NaN	NaN	NaN	NaN	NaN
2013-01-03	-1.596207	-0.918385	-0.817883	4.0	1.0
2013-01-04	-5.044753	-3.853425	-1.417529	2.0	0.0
2013-01-05	-3.605027	-4.526762	-3.606144	1.0	0.0
2013-01-06	NaN	NaN	NaN	NaN	NaN

  <script>
    const buttonEl =
      document.querySelector('#df-99b442a2-cd95-4ff7-886d-18b60cfb80eb button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-99b442a2-cd95-4ff7-886d-18b60cfb80eb');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Apply (적용)

데이터에 함수를 적용한다.

1	df.apply(np.cumsum)

	A	B	C	D	F
2013-01-01	0.000000	0.000000	0.807876	5	NaN
2013-01-02	-0.018762	-0.438046	1.401756	10	1.0
2013-01-03	-0.614969	-0.356431	1.583873	15	3.0
2013-01-04	-2.659722	-1.209856	3.166343	20	6.0
2013-01-05	-2.264749	-1.736618	3.560199	25	10.0
2013-01-06	-3.930627	-1.551715	5.465909	30	15.0

  <script>
    const buttonEl =
      document.querySelector('#df-5f71bb31-515e-4d3e-b14d-f49a6239f43b button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5f71bb31-515e-4d3e-b14d-f49a6239f43b');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	df.apply(lambda x: x.max() - x.min())

A    2.439726
B    1.038329
C    1.723593
D    0.000000
F    4.000000
dtype: float64

Histogramming (히스토그래밍)

1 2	s = pd.Series(np.random.randint(0, 7, size=10)) # 랜덤 생성 s

0    1
1    5
2    0
3    1
4    6
5    6
6    5
7    5
8    5
9    1
dtype: int64

1	s.value_counts()

5    4
1    3
6    2
0    1
dtype: int64

String Methods (문자열 메소드)

Series는 다음의 코드와 같이 문자열 처리 메소드 모음 (set)을 가지고 있다.
이 모음은 배열의 각 요소를 쉽게 조작할 수 있도록 만들어주는 문자열의 속성에 포함되어 있다.
문자열의 패턴 일치 확인은 기본적으로 정규 표현식을 사용.

1 2	s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

6.Merge (병합)

concat (연결)

결합 (join) / 병합 (merge) 형태의 연산에 대한 인덱스, 관계 대수 기능을 위한 다양한 형태의 논리를 포함한 Series, 데이터프레임, Panel 객체를 손쉽게 결합할 수 있도록 하는 다양한 기능을 pandas 에서 제공한다.
concat()으로 pandas 객체를 연결한다.

1 2	df = pd.DataFrame(np.random.randn(10, 4)) df

	0	1	2	3
0	-0.639128	-0.371715	-2.320589	0.932025
1	-1.041656	0.646479	1.551379	-0.353387
2	-0.782444	0.677232	-0.050054	-0.054370
3	-0.418260	0.673768	-1.694420	0.193668
4	0.788359	-0.308937	-0.314680	-0.058661
5	0.457466	-2.021977	0.611340	-0.538168
6	1.355963	1.295236	-0.399497	-0.052334
7	-0.324138	-0.165932	0.290442	0.531520
8	-0.386876	0.217569	0.926404	-0.813724
9	-0.452338	-0.259533	-0.810046	1.186298

  <script>
    const buttonEl =
      document.querySelector('#df-64ce2183-4d4b-4538-857b-4b2f38748c8b button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-64ce2183-4d4b-4538-857b-4b2f38748c8b');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

# break it into pieces
# 조각내고 concat을 통해 다시 연결한다.
pieces = [df[ : 3], df[3 : 7], df[7 : ]]
pd.concat(pieces)

	0	1	2	3
0	-0.639128	-0.371715	-2.320589	0.932025
1	-1.041656	0.646479	1.551379	-0.353387
2	-0.782444	0.677232	-0.050054	-0.054370
3	-0.418260	0.673768	-1.694420	0.193668
4	0.788359	-0.308937	-0.314680	-0.058661
5	0.457466	-2.021977	0.611340	-0.538168
6	1.355963	1.295236	-0.399497	-0.052334
7	-0.324138	-0.165932	0.290442	0.531520
8	-0.386876	0.217569	0.926404	-0.813724
9	-0.452338	-0.259533	-0.810046	1.186298

  <script>
    const buttonEl =
      document.querySelector('#df-b9f8c250-4e6a-49c3-8b57-729bd58c514f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-b9f8c250-4e6a-49c3-8b57-729bd58c514f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Join (결합)

SQL 방식으로 병합한다.

1
2
3

left = pd.DataFrame({'key' : ['foo', 'foo'], 'lval' : [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left

	key	lval
0	foo	1
1	foo	2

  <script>
    const buttonEl =
      document.querySelector('#df-5ee0033f-5dc0-4ee1-a4ae-aab4984c9f56 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5ee0033f-5dc0-4ee1-a4ae-aab4984c9f56');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

right

	key	rval
0	foo	4
1	foo	5

  <script>
    const buttonEl =
      document.querySelector('#df-7b2e03cd-9c2d-471a-aba2-b4a36968ffa5 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-7b2e03cd-9c2d-471a-aba2-b4a36968ffa5');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

위 두 가지를 병합하기

1	pd.merge(left, right, on = 'key')

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

  <script>
    const buttonEl =
      document.querySelector('#df-62da57e4-df6e-4fe9-b5a0-3094a8ba0ed2 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-62da57e4-df6e-4fe9-b5a0-3094a8ba0ed2');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Append (추가)

데이터프레임에 행을 추가한다.

1 2	df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D']) df

	A	B	C	D
0	-0.909696	0.146335	-0.568276	-0.434510
1	-0.802681	0.235747	-0.751299	-0.053560
2	2.005541	-1.265754	-1.152046	-0.081151
3	-0.422940	-0.095189	-1.634583	0.180732
4	-1.535375	-0.594391	-1.102247	0.047852
5	0.369960	-0.902356	-1.196501	-0.109521
6	-1.369044	-2.044557	-0.487275	0.267463
7	0.439153	0.003023	-1.716505	-2.119485

  <script>
    const buttonEl =
      document.querySelector('#df-fb76804e-7023-4762-a17b-9f4e79e4d070 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-fb76804e-7023-4762-a17b-9f4e79e4d070');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

3행의 내용을 복사하여 8행을 추가한다.

1 2	s = df.iloc[3] df.append(s, ignore_index = True)

	A	B	C	D
0	-0.909696	0.146335	-0.568276	-0.434510
1	-0.802681	0.235747	-0.751299	-0.053560
2	2.005541	-1.265754	-1.152046	-0.081151
3	-0.422940	-0.095189	-1.634583	0.180732
4	-1.535375	-0.594391	-1.102247	0.047852
5	0.369960	-0.902356	-1.196501	-0.109521
6	-1.369044	-2.044557	-0.487275	0.267463
7	0.439153	0.003023	-1.716505	-2.119485
8	-0.422940	-0.095189	-1.634583	0.180732

  <script>
    const buttonEl =
      document.querySelector('#df-9c5c9e98-661f-4d00-b16c-b0e389424ef0 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-9c5c9e98-661f-4d00-b16c-b0e389424ef0');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

7.Grouping (그룹화)

룹화는 다음 단계 중 하나 이상을 포함하는 과정을 가리킨다.
- 몇몇 기준에 따라 여러 그룹으로 데이터를 분할 (splitting)
- 각 그룹에 독립적으로 함수를 적용 (applying)
- 결과물들을 하나의 데이터 구조로 결합 (combining)

df = pd.DataFrame(
    {
        'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C' : np.random.randn(8),
        'D' : np.random.randn(8)
    })

df

	A	B	C	D
0	foo	one	0.144726	0.653074
1	bar	one	-0.590503	0.128616
2	foo	two	1.816665	-1.533646
3	bar	three	-1.574489	-0.140956
4	foo	two	0.103910	1.448011
5	bar	two	-0.610817	0.742873
6	foo	one	-1.576850	0.444138
7	foo	three	0.857080	0.157513

  <script>
    const buttonEl =
      document.querySelector('#df-eb85beec-778a-44c7-aac7-eb71caf64586 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-eb85beec-778a-44c7-aac7-eb71caf64586');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

생성된 데이터프레임을 그룹화한 후 각 그룹에 sum() 함수를 적용.

1	df.groupby('A').sum()

	C	D
A
bar	-2.775808	0.730534
foo	1.345531	1.169089

  <script>
    const buttonEl =
      document.querySelector('#df-1aa22f34-79db-46e3-ae05-1187c99d6af5 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-1aa22f34-79db-46e3-ae05-1187c99d6af5');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

여러 열을 기준으로 그룹화하면 계층적 인덱스가 형성된다. 여기에도 sum 함수를 적용 가능.

1	df.groupby(['A', 'B']).sum()

		C	D
A	B
bar	one	-0.590503	0.128616
	three	-1.574489	-0.140956
	two	-0.610817	0.742873
foo	one	-1.432124	1.097212
	three	0.857080	0.157513
	two	1.920575	-0.085635

  <script>
    const buttonEl =
      document.querySelector('#df-4d41ac7d-2fef-4b0e-a140-e235f321dab0 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-4d41ac7d-2fef-4b0e-a140-e235f321dab0');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

8.Reshaping (변형)

Stack (스택)

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

		A	B
first	second
bar	one	-0.827364	-1.346867
bar	two	-1.197194	-0.118960
baz	one	-1.071918	0.825303
baz	two	0.507340	-1.517231

  <script>
    const buttonEl =
      document.querySelector('#df-5a082081-c6ac-484f-9748-3682bcfb55a3 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5a082081-c6ac-484f-9748-3682bcfb55a3');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

stack() 메소드는 데이터프레임 열들의 계층을 “압축” 한다.

1 2	stacked = df2.stack() stacked

first  second   
bar    one     A   -0.827364
               B   -1.346867
       two     A   -1.197194
               B   -0.118960
baz    one     A   -1.071918
               B    0.825303
       two     A    0.507340
               B   -1.517231
dtype: float64

“Stack된” 데이터프레임 또는 (MultiIndex를 인덱스로 사용하는) Series인 경우, stack()의 역 연산은 unstack()이며, 기본적으로 마지막 계층을 unstack 한다.

1	stacked.unstack()

		A	B
first	second
bar	one	-0.827364	-1.346867
bar	two	-1.197194	-0.118960
baz	one	-1.071918	0.825303
baz	two	0.507340	-1.517231

  <script>
    const buttonEl =
      document.querySelector('#df-55a6a6ce-eb1f-401f-8bcb-8246e05bc0dd button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-55a6a6ce-eb1f-401f-8bcb-8246e05bc0dd');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	stacked.unstack(1)

	second	one	two
first
bar	A	-0.827364	-1.197194
bar	B	-1.346867	-0.118960
baz	A	-1.071918	0.507340
baz	B	0.825303	-1.517231

  <script>
    const buttonEl =
      document.querySelector('#df-d48353b8-1fdf-416f-b815-fc90f9b22135 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-d48353b8-1fdf-416f-b815-fc90f9b22135');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	stacked.unstack(0)

	first	bar	baz
second
one	A	-0.827364	-1.071918
one	B	-1.346867	0.825303
two	A	-1.197194	0.507340
two	B	-0.118960	-1.517231

  <script>
    const buttonEl =
      document.querySelector('#df-2305db67-99b3-4d0f-94cd-f423b56a95f6 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-2305db67-99b3-4d0f-94cd-f423b56a95f6');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Pivot Tables (피봇 테이블)

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})

df

	A	B	C	D	E
0	one	A	foo	-0.548983	0.943447
1	one	B	foo	-0.400173	1.836891
2	two	C	foo	0.995067	0.029331
3	three	A	bar	-0.693458	0.457755
4	one	B	bar	0.786452	-0.665671
5	one	C	bar	-0.686570	-1.718177
6	two	A	foo	0.338070	0.163933
7	three	B	foo	1.793455	-0.410172
8	one	C	foo	-0.271664	-0.857467
9	one	A	bar	0.979950	-1.324755
10	two	B	bar	-0.689860	0.907164
11	three	C	bar	-1.210862	-0.276602

  <script>
    const buttonEl =
      document.querySelector('#df-b66bc639-fd5d-4d5c-8180-1ed50d78e959 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-b66bc639-fd5d-4d5c-8180-1ed50d78e959');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

이 데이터로부터 피봇 테이블을 매우 쉽게 생성 가능하다.

1	pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

	C	bar	foo
A	B
one	A	0.979950	-0.548983
	B	0.786452	-0.400173
	C	-0.686570	-0.271664
three	A	-0.693458	NaN
	B	NaN	1.793455
	C	-1.210862	NaN
two	A	NaN	0.338070
	B	-0.689860	NaN
	C	NaN	0.995067

  <script>
    const buttonEl =
      document.querySelector('#df-3e058eed-2670-412b-9bdb-87227d4add5d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-3e058eed-2670-412b-9bdb-87227d4add5d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

9.Time Series (시계열)

Pandas는 자주 일어나는 변환 (예시 : 5분마다 일어나는 데이터에 대한 2차 데이터 변환) 사이에 수행하는 리샘플링 연산을 위한 간단하고, 강력하며, 효율적인 함수를 제공.
이는 재무 (금융) 응용에서 매우 일반적이지만 이에 국한되지는 않는다.

1
2
3

rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()

2012-01-01    23654
Freq: 5T, dtype: int64

시간대를 표현

1
2
3

rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06   -0.480140
2012-03-07   -0.904772
2012-03-08    0.386809
2012-03-09    0.873791
2012-03-10    0.478778
Freq: D, dtype: float64

1 2	ts_utc = ts.tz_localize('UTC') ts_utc

2012-03-06 00:00:00+00:00   -0.480140
2012-03-07 00:00:00+00:00   -0.904772
2012-03-08 00:00:00+00:00    0.386809
2012-03-09 00:00:00+00:00    0.873791
2012-03-10 00:00:00+00:00    0.478778
Freq: D, dtype: float64

다른 시간대로 변환한다.

1	ts_utc.tz_convert('US/Eastern')

2012-03-05 19:00:00-05:00   -0.480140
2012-03-06 19:00:00-05:00   -0.904772
2012-03-07 19:00:00-05:00    0.386809
2012-03-08 19:00:00-05:00    0.873791
2012-03-09 19:00:00-05:00    0.478778
Freq: D, dtype: float64

시간 표현 <–> 기간 표현으로 변환한다.

1
2
3

rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31   -0.914418
2012-02-29   -0.077113
2012-03-31   -0.006825
2012-04-30    0.007167
2012-05-31   -0.733946
Freq: M, dtype: float64

1 2	ps = ts.to_period() ps

2012-01   -0.914418
2012-02   -0.077113
2012-03   -0.006825
2012-04    0.007167
2012-05   -0.733946
Freq: M, dtype: float64

1	ps.to_timestamp()

2012-01-01   -0.914418
2012-02-01   -0.077113
2012-03-01   -0.006825
2012-04-01    0.007167
2012-05-01   -0.733946
Freq: MS, dtype: float64

기간 <–> 시간 변환은 편리한 산술 기능들을 사용할 수 있도록 만들어준다.
다음 예제에서, 11월에 끝나는 연말 결산의 분기별 빈도를 분기말 익월의 월말일 오전 9시로 변환한다.

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

1990-03-01 09:00   -0.685539
1990-06-01 09:00   -1.076153
1990-09-01 09:00    0.737103
1990-12-01 09:00   -1.115201
1991-03-01 09:00    0.680304
Freq: H, dtype: float64

10.Categoricals (범주화)

Pandas는 데이터프레임 내에 범주형 데이터를 포함할 수 있다.

1	df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

가공하지 않은 성적을 범주형 데이터로 변환

1 2	df["grade"] = df["raw_grade"].astype("category") df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

범주에 더 의미 있는 이름을 붙여야 한다. (Series.cat.categories로 할당하는 것이 적합)

1 2	df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

정렬은 사전 순서가 아닌, 해당 범주에서 지정된 순서대로 배열된다.
131번에서 very bad, bad, medium, good, very good 의 순서로 기재되어 있기 때문에 정렬 결과도 해당 순서대로 배열.

1	df.sort_values(by="grade")

	id	raw_grade	grade
0	1	a	a
3	4	a	a
4	5	a	a
1	2	b	b
2	3	b	b
5	6	e	e

  <script>
    const buttonEl =
      document.querySelector('#df-b7146b32-62f2-46d3-9fd0-73a739ee4d33 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-b7146b32-62f2-46d3-9fd0-73a739ee4d33');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

범주의 열을 기준으로 그룹화하면 빈 범주도 표시된다.

1	df.groupby("grade").size()

grade
a    3
b    2
e    1
dtype: int64

11.Plotting (그래프)

1
2
3

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f177f1ef3d0>

png

데이터프레임에서 plot() 메소드는 라벨이 존재하는 모든 열을 그릴 때 편리하다.

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])  

df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

<matplotlib.legend.Legend at 0x7f177ebf3a50>




<Figure size 432x288 with 0 Axes>

png

12.Getting Data In / Out (데이터 입/출력)

CSV

csv 파일에 쓴다.

1	df.to_csv('foo.csv')

csv 파일을 읽어낸다.

1	pd.read_csv('foo.csv')

	Unnamed: 0	A	B	C	D
0	2000-01-01	0.005390	-0.616651	2.261198	-0.868199
1	2000-01-02	-0.084304	-0.247153	0.097660	-0.381440
2	2000-01-03	1.540081	0.806761	0.628394	-0.810376
3	2000-01-04	2.339388	0.573873	2.907442	0.339424
4	2000-01-05	0.938390	2.164131	3.848056	0.158632
...	...	...	...	...	...
995	2002-09-22	-4.732994	38.365117	10.155791	-34.795312
996	2002-09-23	-7.505606	38.661194	10.837375	-35.956062
997	2002-09-24	-4.967844	37.522602	10.977005	-35.639584
998	2002-09-25	-3.707181	35.950703	11.191352	-36.306747
999	2002-09-26	-1.984682	36.604786	10.741370	-35.995049

1000 rows × 5 columns

  <script>
    const buttonEl =
      document.querySelector('#df-f2acd47f-3ad2-400d-b1c5-ff10dfa1d025 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f2acd47f-3ad2-400d-b1c5-ff10dfa1d025');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

HDF5

HDFStores에 읽고 쓴다.

1	df.to_hdf('foo.h5','df')

HDF5 Store에서 읽어온다.

1	pd.read_hdf('foo.h5', 'df')

	A	B	C	D
2000-01-01	0.005390	-0.616651	2.261198	-0.868199
2000-01-02	-0.084304	-0.247153	0.097660	-0.381440
2000-01-03	1.540081	0.806761	0.628394	-0.810376
2000-01-04	2.339388	0.573873	2.907442	0.339424
2000-01-05	0.938390	2.164131	3.848056	0.158632
...	...	...	...	...
2002-09-22	-4.732994	38.365117	10.155791	-34.795312
2002-09-23	-7.505606	38.661194	10.837375	-35.956062
2002-09-24	-4.967844	37.522602	10.977005	-35.639584
2002-09-25	-3.707181	35.950703	11.191352	-36.306747
2002-09-26	-1.984682	36.604786	10.741370	-35.995049

1000 rows × 4 columns

  <script>
    const buttonEl =
      document.querySelector('#df-d8254976-8070-4b22-bdc3-11fbbd746968 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-d8254976-8070-4b22-bdc3-11fbbd746968');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Excel

MS Excel에 읽고 쓴다.
엑셀 파일에 쓴다.

1	df.to_excel('foo.xlsx', sheet_name='Sheet1')

엑셀 파일을 읽어온다.

1	pd.read_excel('foo.xlsx', 'Sheet1', index_col = None, na_values=['NA'])

	Unnamed: 0	A	B	C	D
0	2000-01-01	0.005390	-0.616651	2.261198	-0.868199
1	2000-01-02	-0.084304	-0.247153	0.097660	-0.381440
2	2000-01-03	1.540081	0.806761	0.628394	-0.810376
3	2000-01-04	2.339388	0.573873	2.907442	0.339424
4	2000-01-05	0.938390	2.164131	3.848056	0.158632
...	...	...	...	...	...
995	2002-09-22	-4.732994	38.365117	10.155791	-34.795312
996	2002-09-23	-7.505606	38.661194	10.837375	-35.956062
997	2002-09-24	-4.967844	37.522602	10.977005	-35.639584
998	2002-09-25	-3.707181	35.950703	11.191352	-36.306747
999	2002-09-26	-1.984682	36.604786	10.741370	-35.995049

1000 rows × 5 columns

  <script>
    const buttonEl =
      document.querySelector('#df-0adace20-cbb2-4908-846b-7f1dd49ea7cb button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-0adace20-cbb2-4908-846b-7f1dd49ea7cb');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

13.Gotchas (잡았다!)

연산 수행 시 다음과 같은 예외 상황(Error)을 볼 수도 있다.

1 2	if pd.Series([False, True, False]): print("I was true")

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-129-5c782b38cd2f> in <module>()
----> 1 if pd.Series([False, True, False]):
      2     print("I was true")


/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1536     def __nonzero__(self):
   1537         raise ValueError(
-> 1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1540         )


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

이런 경우에는 any(), all(), empty 등을 사용해서 무엇을 원하는지를 선택 (반영)해주어야 한다.

1 2	if pd.Series([False, True, False])is not None: print("I was not None")

I was not None

End of document

Posted 2022-03-23Updated 2022-03-24 minkuen python / library / pandas33 minutes read (About 4975 words)

pandas_tutorial_01

데이터 전처리

데이터 전처리의 기본
- garbage Data —[ Great Model ]—> Garbage Results

데이터 전처리의 주요 과정

데이터 전처리 수행 프로세스***(중요)***
- 1.중복값 제거 및 결측치 처리 -> 2.이상치 처리 -> 3.Feature Engineering
- 주요 목적 : 더 나은 분석 결과 도출 및 모형 성능 개선 실현

pandas 기본자료형

숫자, 문자 모두 들어간다.
중복이 불가능하다.

판다스

라이브러리 불러오기

1 2	import pandas as pd print(pd.__version__)

1.3.5

테스트

temp_dic = {"col1": [1,2,3],
            "col2": [3, 4, 5]}

df = pd.DataFrame(temp_dic) # DataFrame() 사용
df

	col1	col2
0	1	3
1	2	4
2	3	5

  <script>
    const buttonEl =
      document.querySelector('#df-5f12a67d-363f-495e-a8d6-e15402e0c5d6 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5f12a67d-363f-495e-a8d6-e15402e0c5d6');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	print(type(df))

<class 'pandas.core.frame.DataFrame'>

temp_dic = {'a':1, 'b':2, 'c':3}
ser = pd.Series(temp_dic) # Series() 사용
print(ser)
print(type(ser))

a    1
b    2
c    3
dtype: int64
<class 'pandas.core.series.Series'>

구글 드라이브 연동

구글 드라이브 → colab notebook → 새 폴더 생성 : data → 슬랙에서 다운 받은 lemonade.csv 파일을 올린다 -> 다음 코드를 실행

1 2	from google.colab import drive drive.mount('/content/drive')

Mounted at /content/drive

Mounted at ..drive 가 출력되었으므로 성공

현재 좌측에 폴더 그림 -> drive -> mydrive -> Colab Notebooks -> data -> Lemonade2016.csv를 찾아서 우클릭 -> 경로 복사 -> 다음 코드에 붙여넣어 사용

1
2
3

DATA_PATH = '/content/drive/MyDrive/Colab Notebooks/data/Lemonade2016.csv'
juice = pd.read_csv(DATA_PATH)
juice

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price
0	7/1/2016	Park	97	67	70	90.0	0.25
1	7/2/2016	Park	98	67	72	90.0	0.25
2	7/3/2016	Park	110	77	71	104.0	0.25
3	7/4/2016	Beach	134	99	76	98.0	0.25
4	7/5/2016	Beach	159	118	78	135.0	0.25
5	7/6/2016	Beach	103	69	82	90.0	0.25
6	7/6/2016	Beach	103	69	82	90.0	0.25
7	7/7/2016	Beach	143	101	81	135.0	0.25
8	NaN	Beach	123	86	82	113.0	0.25
9	7/9/2016	Beach	134	95	80	126.0	0.25
10	7/10/2016	Beach	140	98	82	131.0	0.25
11	7/11/2016	Beach	162	120	83	135.0	0.25
12	7/12/2016	Beach	130	95	84	99.0	0.25
13	7/13/2016	Beach	109	75	77	99.0	0.25
14	7/14/2016	Beach	122	85	78	113.0	0.25
15	7/15/2016	Beach	98	62	75	108.0	0.50
16	7/16/2016	Beach	81	50	74	90.0	0.50
17	7/17/2016	Beach	115	76	77	126.0	0.50
18	7/18/2016	Park	131	92	81	122.0	0.50
19	7/19/2016	Park	122	85	78	113.0	0.50
20	7/20/2016	Park	71	42	70	NaN	0.50
21	7/21/2016	Park	83	50	77	90.0	0.50
22	7/22/2016	Park	112	75	80	108.0	0.50
23	7/23/2016	Park	120	82	81	117.0	0.50
24	7/24/2016	Park	121	82	82	117.0	0.50
25	7/25/2016	Park	156	113	84	135.0	0.50
26	7/26/2016	Park	176	129	83	158.0	0.35
27	7/27/2016	Park	104	68	80	99.0	0.35
28	7/28/2016	Park	96	63	82	90.0	0.35
29	7/29/2016	Park	100	66	81	95.0	0.35
30	7/30/2016	Beach	88	57	82	81.0	0.35
31	7/31/2016	Beach	76	47	82	68.0	0.35

  <script>
    const buttonEl =
      document.querySelector('#df-e4ed5b94-20e7-42ba-9f65-459f54e1728a button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-e4ed5b94-20e7-42ba-9f65-459f54e1728a');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

데이터를 불러왔다.
첫번째 파악해야 하는 것!
- 데이터 구조를 파악해보자

1	juice.info() # 데이터 구조 파악

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         31 non-null     object 
 1   Location     32 non-null     object 
 2   Lemon        32 non-null     int64  
 3   Orange       32 non-null     int64  
 4   Temperature  32 non-null     int64  
 5   Leaflets     31 non-null     float64
 6   Price        32 non-null     float64
dtypes: float64(2), int64(3), object(2)
memory usage: 1.9+ KB

1	juice.head() # 상위의 데이터를 여러개 불러온다. 디폴트 값이 5개.

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price
0	7/1/2016	Park	97	67	70	90.0	0.25
1	7/2/2016	Park	98	67	72	90.0	0.25
2	7/3/2016	Park	110	77	71	104.0	0.25
3	7/4/2016	Beach	134	99	76	98.0	0.25
4	7/5/2016	Beach	159	118	78	135.0	0.25

  <script>
    const buttonEl =
      document.querySelector('#df-c3238942-1033-4010-80b8-10e94c66dc23 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-c3238942-1033-4010-80b8-10e94c66dc23');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	juice.tail() # 하위의 데이터를 여러개 불러온다. 디폴트 값이 5개

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price
27	7/27/2016	Park	104	68	80	99.0	0.35
28	7/28/2016	Park	96	63	82	90.0	0.35
29	7/29/2016	Park	100	66	81	95.0	0.35
30	7/30/2016	Beach	88	57	82	81.0	0.35
31	7/31/2016	Beach	76	47	82	68.0	0.35

  <script>
    const buttonEl =
      document.querySelector('#df-cc60af2a-dd96-48c1-9398-546b4a947c77 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-cc60af2a-dd96-48c1-9398-546b4a947c77');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Describe() 함수
- 기술통계량 확인해주는 함수

1	juice.describe()

	Lemon	Orange	Temperature	Leaflets	Price
count	32.000000	32.000000	32.000000	31.000000	32.000000
mean	116.156250	80.000000	78.968750	108.548387	0.354687
std	25.823357	21.863211	4.067847	20.117718	0.113137
min	71.000000	42.000000	70.000000	68.000000	0.250000
25%	98.000000	66.750000	77.000000	90.000000	0.250000
50%	113.500000	76.500000	80.500000	108.000000	0.350000
75%	131.750000	95.000000	82.000000	124.000000	0.500000
max	176.000000	129.000000	84.000000	158.000000	0.500000

  <script>
    const buttonEl =
      document.querySelector('#df-bfd69db7-f9d2-49ea-84ed-2989ca9e02a8 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-bfd69db7-f9d2-49ea-84ed-2989ca9e02a8');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	type(juice.describe()) # Describe함수 결과물의 타입은 DataFrame 이다.

pandas.core.frame.DataFrame

value_counts()

1 2	print(juice['Location'].value_counts()) print(type(juice['Location'].value_counts()))

Beach    17
Park     15
Name: Location, dtype: int64
<class 'pandas.core.series.Series'>

데이터 다뤄보기

행과 열을 핸들링 해보자.

1 2	juice['Sold'] = 0 # sold 열 추가. print(juice.head(3))

       Date Location  Lemon  Orange  Temperature  Leaflets  Price  Sold
0  7/1/2016     Park     97      67           70      90.0   0.25     0
1  7/2/2016     Park     98      67           72      90.0   0.25     0
2  7/3/2016     Park    110      77           71     104.0   0.25     0

1 2	juice['Sold'] = juice['Lemon'] + juice['Orange'] # Sold에 값 설정 print(juice.head(3))

       Date Location  Lemon  Orange  Temperature  Leaflets  Price  Sold
0  7/1/2016     Park     97      67           70      90.0   0.25   164
1  7/2/2016     Park     98      67           72      90.0   0.25   165
2  7/3/2016     Park    110      77           71     104.0   0.25   187

매출액 = 가격 x 판매량
Revenue 로 만들어보자

1 2	juice['Revenue'] = juice['Sold'] * juice['Price'] print(juice.head(3))

       Date Location  Lemon  Orange  Temperature  Leaflets  Price  Sold  \
0  7/1/2016     Park     97      67           70      90.0   0.25   164   
1  7/2/2016     Park     98      67           72      90.0   0.25   165   
2  7/3/2016     Park    110      77           71     104.0   0.25   187   

   Revenue  
0    41.00  
1    41.25  
2    46.75

drop(axis = 0|1)
- axis는 ‘축’을 의미한다. 한 축을 따라서 연산이 된다.
- axis를 0으로 설정 시, 행(=index)방향으로 drop() 실행
- axis를 1로 설정 시, 열방향으로 drop 수행함.

1 2	juice_column_drop = juice.drop('Sold', axis = 1) print(juice_column_drop.head(3))

       Date Location  Lemon  Orange  Temperature  Leaflets  Price  Revenue
0  7/1/2016     Park     97      67           70      90.0   0.25    41.00
1  7/2/2016     Park     98      67           72      90.0   0.25    41.25
2  7/3/2016     Park    110      77           71     104.0   0.25    46.75

1 2	juice_row_drop = juice.drop(0, axis = 0) print(juice_row_drop.head(3))

       Date Location  Lemon  Orange  Temperature  Leaflets  Price  Sold  \
1  7/2/2016     Park     98      67           72      90.0   0.25   165   
2  7/3/2016     Park    110      77           71     104.0   0.25   187   
3  7/4/2016    Beach    134      99           76      98.0   0.25   233   

   Revenue  
1    41.25  
2    46.75  
3    58.25

데이터 인덱싱

1	juice[0:5]

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
0	7/1/2016	Park	97	67	70	90.0	0.25	164	41.00
1	7/2/2016	Park	98	67	72	90.0	0.25	165	41.25
2	7/3/2016	Park	110	77	71	104.0	0.25	187	46.75
3	7/4/2016	Beach	134	99	76	98.0	0.25	233	58.25
4	7/5/2016	Beach	159	118	78	135.0	0.25	277	69.25

  <script>
    const buttonEl =
      document.querySelector('#df-bfa3fabe-e933-4527-879f-12c188c0b8bd button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-bfa3fabe-e933-4527-879f-12c188c0b8bd');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

boolean 값을 활용한 데이터 추출

1
2
3

# location이 Beach인 경우
# juice['Location'].value_counts()
juice[juice['Location'] == "Beach"]

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
3	7/4/2016	Beach	134	99	76	98.0	0.25	233	58.25
4	7/5/2016	Beach	159	118	78	135.0	0.25	277	69.25
5	7/6/2016	Beach	103	69	82	90.0	0.25	172	43.00
6	7/6/2016	Beach	103	69	82	90.0	0.25	172	43.00
7	7/7/2016	Beach	143	101	81	135.0	0.25	244	61.00
8	NaN	Beach	123	86	82	113.0	0.25	209	52.25
9	7/9/2016	Beach	134	95	80	126.0	0.25	229	57.25
10	7/10/2016	Beach	140	98	82	131.0	0.25	238	59.50
11	7/11/2016	Beach	162	120	83	135.0	0.25	282	70.50
12	7/12/2016	Beach	130	95	84	99.0	0.25	225	56.25
13	7/13/2016	Beach	109	75	77	99.0	0.25	184	46.00
14	7/14/2016	Beach	122	85	78	113.0	0.25	207	51.75
15	7/15/2016	Beach	98	62	75	108.0	0.50	160	80.00
16	7/16/2016	Beach	81	50	74	90.0	0.50	131	65.50
17	7/17/2016	Beach	115	76	77	126.0	0.50	191	95.50
30	7/30/2016	Beach	88	57	82	81.0	0.35	145	50.75
31	7/31/2016	Beach	76	47	82	68.0	0.35	123	43.05

  <script>
    const buttonEl =
      document.querySelector('#df-f23f5092-ba57-4126-bdd5-ecc3581c90cd button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f23f5092-ba57-4126-bdd5-ecc3581c90cd');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1
2
3

# location이 Beach인 경우
# juice['Location'].value_counts()
juice[juice['Leaflets'] >= 100]

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
2	7/3/2016	Park	110	77	71	104.0	0.25	187	46.75
4	7/5/2016	Beach	159	118	78	135.0	0.25	277	69.25
7	7/7/2016	Beach	143	101	81	135.0	0.25	244	61.00
8	NaN	Beach	123	86	82	113.0	0.25	209	52.25
9	7/9/2016	Beach	134	95	80	126.0	0.25	229	57.25
10	7/10/2016	Beach	140	98	82	131.0	0.25	238	59.50
11	7/11/2016	Beach	162	120	83	135.0	0.25	282	70.50
14	7/14/2016	Beach	122	85	78	113.0	0.25	207	51.75
15	7/15/2016	Beach	98	62	75	108.0	0.50	160	80.00
17	7/17/2016	Beach	115	76	77	126.0	0.50	191	95.50
18	7/18/2016	Park	131	92	81	122.0	0.50	223	111.50
19	7/19/2016	Park	122	85	78	113.0	0.50	207	103.50
22	7/22/2016	Park	112	75	80	108.0	0.50	187	93.50
23	7/23/2016	Park	120	82	81	117.0	0.50	202	101.00
24	7/24/2016	Park	121	82	82	117.0	0.50	203	101.50
25	7/25/2016	Park	156	113	84	135.0	0.50	269	134.50
26	7/26/2016	Park	176	129	83	158.0	0.35	305	106.75

  <script>
    const buttonEl =
      document.querySelector('#df-080b31c4-9d87-4d46-a98d-5d6eec44b68f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-080b31c4-9d87-4d46-a98d-5d6eec44b68f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

iloc vs loc

차이를 확인한다!

1
2
3

juice.head(3)
# index 번호는 다음 실행 결과에서
#     0      1        2       3         4            5          6       7      8

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
0	7/1/2016	Park	97	67	70	90.0	0.25	164	41.00
1	7/2/2016	Park	98	67	72	90.0	0.25	165	41.25
2	7/3/2016	Park	110	77	71	104.0	0.25	187	46.75

  <script>
    const buttonEl =
      document.querySelector('#df-dadb1a11-c681-42a1-9b8b-85510d760ea0 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-dadb1a11-c681-42a1-9b8b-85510d760ea0');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

%%time

실행 시간 측정
코드의 효율을 살펴보자

1
2
3

%%time

juice.iloc[0:3, 0:2]  # 인덱스 기반

CPU times: user 2.14 ms, sys: 0 ns, total: 2.14 ms
Wall time: 3.19 ms

	Date	Location
0	7/1/2016	Park
1	7/2/2016	Park
2	7/3/2016	Park

  <script>
    const buttonEl =
      document.querySelector('#df-dfd7fd78-f2b8-491f-a422-bd0e37bc0297 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-dfd7fd78-f2b8-491f-a422-bd0e37bc0297');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

loc
-> 라벨 기반

1
2
3

%%time

juice.loc[0:2, ['Date', 'Location']]  # 라벨 기반이라 0:2로 실행 시, 3개 생성된다

CPU times: user 1.64 ms, sys: 0 ns, total: 1.64 ms
Wall time: 1.62 ms

	Date	Location
0	7/1/2016	Park
1	7/2/2016	Park
2	7/3/2016	Park

  <script>
    const buttonEl =
      document.querySelector('#df-f63e27df-425e-4e52-acd7-6c213c3c886a button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f63e27df-425e-4e52-acd7-6c213c3c886a');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1
2
3

# juice[juice['Leaflets'] >= 100, 컬럼명 별도 추출]

juice.loc[juice['Leaflets'] >= 100, ['Date', 'Location']] # 컬럼은 컬럼별로 추출

	Date	Location
2	7/3/2016	Park
4	7/5/2016	Beach
7	7/7/2016	Beach
8	NaN	Beach
9	7/9/2016	Beach
10	7/10/2016	Beach
11	7/11/2016	Beach
14	7/14/2016	Beach
15	7/15/2016	Beach
17	7/17/2016	Beach
18	7/18/2016	Park
19	7/19/2016	Park
22	7/22/2016	Park
23	7/23/2016	Park
24	7/24/2016	Park
25	7/25/2016	Park
26	7/26/2016	Park

  <script>
    const buttonEl =
      document.querySelector('#df-80c7ea4c-f3b6-416c-90a6-935ca4d10c87 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-80c7ea4c-f3b6-416c-90a6-935ca4d10c87');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

1	juice.iloc[juice['Leaflets'] >= 100, 0:2]

정렬

sort_values()

1 2	# 매출액 순서로 정렬 juice.sort_values(by=['Revenue'], ascending=False).head(3)

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
25	7/25/2016	Park	156	113	84	135.0	0.50	269	134.50
18	7/18/2016	Park	131	92	81	122.0	0.50	223	111.50
26	7/26/2016	Park	176	129	83	158.0	0.35	305	106.75

  <script>
    const buttonEl =
      document.querySelector('#df-d4ef77c6-8bab-4eae-9f75-40bfaf70f3b7 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-d4ef77c6-8bab-4eae-9f75-40bfaf70f3b7');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

# 2개 이상 기준으로 할 경우, 그룹화하여 정렬됨
juice.sort_values(by=['Price', 'Temperature'], ascending=False) 

#이것도 가능
#juice.sort_values(by=['Price', 'Temperature'], ascending=[False, True]) -> Price는 내림차순, Temparature는 오름차순
#juice.sort_values(by=['Price', 'Temperature'], ascending=[False, True]).rest_index(drop=True)  -> index 번호도 재정렬

	Date	Location	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
25	7/25/2016	Park	156	113	84	135.0	0.50	269	134.50
24	7/24/2016	Park	121	82	82	117.0	0.50	203	101.50
18	7/18/2016	Park	131	92	81	122.0	0.50	223	111.50
23	7/23/2016	Park	120	82	81	117.0	0.50	202	101.00
22	7/22/2016	Park	112	75	80	108.0	0.50	187	93.50
19	7/19/2016	Park	122	85	78	113.0	0.50	207	103.50
17	7/17/2016	Beach	115	76	77	126.0	0.50	191	95.50
21	7/21/2016	Park	83	50	77	90.0	0.50	133	66.50
15	7/15/2016	Beach	98	62	75	108.0	0.50	160	80.00
16	7/16/2016	Beach	81	50	74	90.0	0.50	131	65.50
20	7/20/2016	Park	71	42	70	NaN	0.50	113	56.50
26	7/26/2016	Park	176	129	83	158.0	0.35	305	106.75
28	7/28/2016	Park	96	63	82	90.0	0.35	159	55.65
30	7/30/2016	Beach	88	57	82	81.0	0.35	145	50.75
31	7/31/2016	Beach	76	47	82	68.0	0.35	123	43.05
29	7/29/2016	Park	100	66	81	95.0	0.35	166	58.10
27	7/27/2016	Park	104	68	80	99.0	0.35	172	60.20
12	7/12/2016	Beach	130	95	84	99.0	0.25	225	56.25
11	7/11/2016	Beach	162	120	83	135.0	0.25	282	70.50
5	7/6/2016	Beach	103	69	82	90.0	0.25	172	43.00
6	7/6/2016	Beach	103	69	82	90.0	0.25	172	43.00
8	NaN	Beach	123	86	82	113.0	0.25	209	52.25
10	7/10/2016	Beach	140	98	82	131.0	0.25	238	59.50
7	7/7/2016	Beach	143	101	81	135.0	0.25	244	61.00
9	7/9/2016	Beach	134	95	80	126.0	0.25	229	57.25
4	7/5/2016	Beach	159	118	78	135.0	0.25	277	69.25
14	7/14/2016	Beach	122	85	78	113.0	0.25	207	51.75
13	7/13/2016	Beach	109	75	77	99.0	0.25	184	46.00
3	7/4/2016	Beach	134	99	76	98.0	0.25	233	58.25
1	7/2/2016	Park	98	67	72	90.0	0.25	165	41.25
2	7/3/2016	Park	110	77	71	104.0	0.25	187	46.75
0	7/1/2016	Park	97	67	70	90.0	0.25	164	41.00

  <script>
    const buttonEl =
      document.querySelector('#df-14051fd5-627b-4ebe-ab05-3415f55cc7f3 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-14051fd5-627b-4ebe-ab05-3415f55cc7f3');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Group by

R dplyr groupby() %>% summarize()
-> 데이터 요약 -> 엑셀로 피벗 테이블

1 2	# Location 항목을 카운트 juice.groupby(by = 'Location').count()

	Date	Lemon	Orange	Temperature	Leaflets	Price	Sold	Revenue
Location
Beach	16	17	17	17	17	17	17	17
Park	15	15	15	15	14	15	15	15

  <script>
    const buttonEl =
      document.querySelector('#df-26afeca1-6bb7-494f-ba2d-92aab015b058 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-26afeca1-6bb7-494f-ba2d-92aab015b058');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

# 집계 함수
import numpy as np

juice.groupby(['Location'])['Revenue'].agg([max, min, sum, np.mean])

	max	min	sum	mean
Location
Beach	95.5	43.0	1002.8	58.988235
Park	134.5	41.0	1178.2	78.546667

  <script>
    const buttonEl =
      document.querySelector('#df-650575f1-c764-4097-b860-3fa1b26021c5 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-650575f1-c764-4097-b860-3fa1b26021c5');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

# 집계 함수
import numpy as np

juice.groupby(['Location'])['Revenue', 'Lemon'].agg([max, min, sum, np.mean])

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  after removing the cwd from sys.path.

	Revenue				Lemon
	max	min	sum	mean	max	min	sum	mean
Location
Beach	95.5	43.0	1002.8	58.988235	162	76	2020	118.823529
Park	134.5	41.0	1178.2	78.546667	176	71	1697	113.133333

  <script>
    const buttonEl =
      document.querySelector('#df-7a3b6989-de2d-4a76-8bd8-66538dc5863c button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-7a3b6989-de2d-4a76-8bd8-66538dc5863c');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Posted 2022-03-23Updated 2022-03-23 minkuen python / library / numpy10 minutes read (About 1566 words)

python_numpy_01

라이브러리

여러가지 라이브러리를 사용해보자

Numpy

Q : what is numpy?

A : 배열 연산이다.

Q : why numpy?

A : 두개의 리스트 연산 시도 → Type Error → Numpy 도입

# 다음 코드는 사용 시 error가 나온다.
A = [1, 2, 3]
B = [4, 5, 6]

A / B ** 2

# numpy 사용 시 정상적으로 작동한다.
import numpy as np

A = [1, 2, 3]
B = [4 ,5, 6]

np_A = np.array(A)
np_B = np.array(B)
np_A / np_B ** 2

array([0.0625    , 0.08      , 0.08333333])

Reshape

사용 예시

(2,3) 배열 -> np.reshape(3,2) -> (3,2)배열

사용 예시

np.reshape(-1, 2)에서 -1의 의미
: 특정 차원에서 열은 2로 고정된 상태에서 행은 사이즈에 맞도록 자동으로 정렬해준다는 뜻

import numpy as np
temp_arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
temp_arr

new_arr = temp_arr.reshape(2, -1)
new_arr

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16]])

1
2
3

# -1을 이용한 자동 정렬
new_arr = temp_arr.reshape(4, -1)
new_arr

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

조언

머신러닝 / 딥러닝
- 수학을 잘 하는 사람 vs 수학을 처음 하는 사람
- 머신러닝 / 딥러닝 (인간이 만든 수식!)
- 개념을 이해하고, 수식으로 이해하고, 코드로 그 수식을 구현해야
- 머신러닝과 딥러닝을 쓰기 위해서는 수학자만 해야되냐!?
- 결론은 아닙니다!
- 머신러닝 / 딥러닝의 주 목적이 인간 생활의 보편적인 무제 해결을 위해 나온 것
- 프레임워크로 형태로 내놨어요 (개념을 이해하고 있자!)
  - 개념만 문자열 타입으로 매개변수를 잘 조정만 하면 모델 만들어짐!
- 성과를 내야 하는데 (개발자는 배포를 잘해야 함!)
  - 이미지 인식 모델을 만듬 / (쓸데가 없음…) / 안드로이드 앱 / 웹앱에 탑재할줄만 알아도
  - 기획 (어떤 문데를 풀까?)
- AutoML
  - 코드를 4 ~ 5줄 치면 머신러닝 모델이 만들어짐!
  - 하지만 이공계 출신이라면 수식도 나름대로 정리해 볼 것

라이브러리 설치 방법 (vs R)

# R install.packages("패키지명")
# 파이썬 라이브러리 설치 코드에서 실행 (X)
# 터미널에서 설치

# 방법1. conda 설치
# --> 아나콘다 설치 후, conda 설치 (데이터 과학)
# conda 라이브러리 관리 (버전 업데이트가 조금 느림)

# 방법2. pip 설치
# --> 아나콘다 설치 안 함 / 파이썬만 설치

# git bash 열고, pip install numpy
#pip install numpy
# google colab에선 기본적 환경이 갖추어져 있음.

Numpy 라이브 불러오기

1
2
3

# 다음과 같이 np 라고 줄여서 출력한다.
import numpy as np
print(np.__version__)

1.21.5

배열로 변환

1부터 10까지의 리스트를 만든다.
Numpy 배열로 변환해서 저장한다.

temp = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
arr = np.array(temp)
print(arr)
print(temp)

[ 1  2  3  4  5  6  7  8  9 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

1 2	print(type(arr)) print(type(temp))

<class 'numpy.ndarray'>
<class 'list'>

Numpy를 사용하여 기초 통계 함수를 사용한다.

print(np.mean(arr))
print(np.sum(arr))
print(np.median(arr))
print(np.std(arr))

5.5
55
5.5
2.8722813232690143

사칙연산

math_scores = [90,80, 88]
english_scores = [80, 70, 90]

total_scores = math_scores + english_scores
total_scores

[90, 80, 88, 80, 70, 90]

math_scores = [90,80, 88]
english_scores = [80, 70, 90]

math_arr = np.array(math_scores)
english_arr = np.array(english_scores)

total_scores = math_arr + english_arr
total_scores

array([170, 150, 178])

1	np.min(total_scores)

1	np.max(total_scores)

math_scores = [2, 3, 4]
english_scores = [1, 2, 3]

math_arr = np.array(math_scores)
english_arr = np.array(english_scores)

#사칙연산
print("덧셈 : ", np.add(math_arr, english_arr))
print("뺄셈 : ", np.subtract(math_arr, english_arr))
print("곱셈 : ", np.multiply(math_arr, english_arr))
print("나눗셈 : ", np.divide(math_arr, english_arr))
print("거듭제곱 : ", np.power(math_arr, english_arr))

덧셈 :  [3 5 7]
뺄셈 :  [1 1 1]
곱셈 :  [ 2  6 12]
나눗셈 :  [2.         1.5        1.33333333]
거듭제곱 :  [ 2  9 64]

배열의 생성

0차원부터 3차원까지 생성하는 방법

temp_arr = np.array(20)
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape)

20
<class 'numpy.ndarray'>
()

# 1차원 배열
temp_arr = np.array([1,2,3])
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape) # 배열의 형태 확인용
print(temp_arr.ndim)  # ndim은 차원 확인용

[1 2 3]
<class 'numpy.ndarray'>
(3,)
1

# 2차원 배열
temp_arr = np.array([[1,2,3], [4, 5, 6]])
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape)
print(temp_arr.ndim)  # ndim 은 차원 확인

[[1 2 3]
 [4 5 6]]
<class 'numpy.ndarray'>
(2, 3)
2

# 3차원 배열
temp_arr = np.array([[[1,2,3], [4, 5, 6]], [[1,2,3], [4, 5, 6]]])
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape)
print(temp_arr.ndim)  # ndim 은 차원 확인

[[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]]
<class 'numpy.ndarray'>
(2, 2, 3)
3

temp_arr = np.array([1, 2, 3, 4], ndmin = 2) # 차원을 변경 가능
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape)
print(temp_arr.ndim)

[[1 2 3 4]]
<class 'numpy.ndarray'>
(1, 4)
2

소수점 정렬

1 2	temp_arr = np.trunc([-1.23, 1.23]) temp_arr

array([-1.,  1.])

1 2	temp_arr = np.fix([-1.23, 1.23]) temp_arr

array([-1.,  1.])

1
2
3

# 반올림
temp_arr = np.around([-1.63789, 1.23784], 4)  # 소수점 아래 4번째자리로 반올림 한다는 표현
temp_arr

array([-1.6379,  1.2378])

1
2
3

# 올림
temp_arr = np.floor([-1.63789, 1.23784])
temp_arr

array([-2.,  1.])

1
2
3

# 내림
temp_arr = np.ceil([-1.63789, 1.23784])
temp_arr

array([-1.,  2.])

shape 높이 * 세로 * 가로 순인건가요?
axis 축 설정

배열을 사용하는 다양한 방법들

1
2
3

# np.arange(5) -> 0 부터 시작하는 5개의 배열 생성
temp_arr = np.arange(5)
temp_arr

array([0, 1, 2, 3, 4])

1
2
3

# np.arange(1, 11, 3) -> 1 부터 11까지 3만큼 차이나게 배열 생성
temp_arr = np.arange(1, 11, 3)
temp_arr

array([ 1,  4,  7, 10])

# np.zeros -> 0으로 채운 배열 만들기
zero_arr = np.zeros((2,3))
print(zero_arr)
print(type(zero_arr))
print(zero_arr.shape)
print(zero_arr.ndim)
print(zero_arr.dtype) # dype = data type

[[0. 0. 0.]
 [0. 0. 0.]]
<class 'numpy.ndarray'>
(2, 3)
2
float64

# np.ones -> 1로 채운 배열 만들기
temp_arr = np.ones((4,5), dtype="int32")  
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape)
print(temp_arr.ndim)
print(temp_arr.dtype)

[[1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]]
<class 'numpy.ndarray'>
(4, 5)
2
int32

temp_arr = np.ones((2,6), dtype="int32")  
print(temp_arr)
print(type(temp_arr))
print(temp_arr.shape)
print(temp_arr.ndim)
print(temp_arr.dtype)

[[1 1 1 1 1 1]
 [1 1 1 1 1 1]]
<class 'numpy.ndarray'>
(2, 6)
2
int32

# reshape() 사용하여 배열 변환하기 
temp_arr = np.ones((12,12), dtype="int32")  
temp_res_arr = temp_arr.reshape(4, -1)    # -1 은 자동정령
print(temp_res_arr)
print(type(temp_res_arr))
print(temp_res_arr.shape)
print(temp_res_arr.ndim)
print(temp_res_arr.dtype)

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
<class 'numpy.ndarray'>
(4, 36)
2
int32

numpy 조건식

where(a, b, c) 사용법
- a조건 True면 b로 변환, False이면 c로 변환

1 2	temp_arr = np.arange(10) temp_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# 5보다 작은 값은 원래값으로 변환
# 5보다 큰 값은 원래 값 * 10
np.where(temp_arr <5, temp_arr, temp_arr * 10)

#  where(a, b, c) a조건 True면 b로 변환, False이면 c로 변환

array([ 0,  1,  2,  3,  4, 50, 60, 70, 80, 90])

# 0~100 까지의 배열 만들고, 50보다 작은 값은 곱하기 10, 나머지는 그냥 원래 값으로 반환
temp_arr = np.arange(101)
temp_arr

np.where(temp_arr <50, temp_arr * 10, temp_arr)

array([  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120,
       130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250,
       260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380,
       390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100])

두가지 조건식을 사용해야 할 경우

np.select
사용법은 다음 코드를 참고

temp_arr = np.arange(10)
temp_arr

# 5보다 큰 값은 곱하기 2, 2보다 작은 값은 더하기 100
condlist = [temp_arr > 5, temp_arr <2]    # 조건식
choielist = [temp_arr *2, temp_arr + 100] # 같은 위치의 조건 만족 시, 설정한 대로 변환
np.select(condlist, choielist, default = temp_arr)

array([100, 101,   2,   3,   4,   5,  12,  14,  16,  18])

브로드캐스팅

서로 다른 크기의 배열을 계산할 때 참고해야하는 내용이다.

팁

데이터 시각화

데이터 시각화의 기본 조건

참고

라이브러리 불러오기

시각화 그려보기

주섹 데이터 다운로드 받기

pyplot 형태

객체지향으로 그리기

막대 그래프

Seaborn

막대 그래프 그리기 seaborn 방식

어려운 시각화 그래프

구글 드라이브 연동

Group by

결측치 다루기

이상치

Pandas 10분 완성

1.Object Creation (객체 생성)

2.Viewing Data (데이터 확인하기)

3.Selection (선택)

Selection by Label (Label을 통한 선택)

Selection by Position(위치로 선택하기)

Boolean Indexing

isin

Setting (설정)

4.Missing Data (결측치)

5.Operation (연산)

Stats (통계)

Apply (적용)

Histogramming (히스토그래밍)

String Methods (문자열 메소드)

6.Merge (병합)

concat (연결)

Join (결합)

Append (추가)

7.Grouping (그룹화)

8.Reshaping (변형)

Stack (스택)

Pivot Tables (피봇 테이블)

9.Time Series (시계열)

10.Categoricals (범주화)

11.Plotting (그래프)

12.Getting Data In / Out (데이터 입/출력)

CSV

HDF5

Excel

13.Gotchas (잡았다!)

End of document

데이터 전처리

판다스

라이브러리 불러오기

테스트

구글 드라이브 연동

데이터 다뤄보기

데이터 인덱싱

boolean 값을 활용한 데이터 추출

iloc vs loc

정렬

Group by

라이브러리

Numpy

Reshape

배열로 변환

사칙연산

배열의 생성

소수점 정렬

배열을 사용하는 다양한 방법들

numpy 조건식

두가지 조건식을 사용해야 할 경우

브로드캐스팅

Links

Categories

Recents

Archives

Tags

Subscribe for updates

follow.it