๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ’ก Data Analysis/๐Ÿ“‚ Project - Analysis of KakaoTalk (end)

[DA][Python] (2์ฐจ ์„ค๊ณ„ ๋ฐ ์™„์„ฑ) ์นด์นด์˜คํ†ก ๋Œ€ํ™” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์ฝ”๋“œ ์ˆ˜์ • ์„ค๊ณ„

by Sun A 2024. 7. 8.

ํ˜„์žฌ ํ•ด๋‹น ๋‚ด์šฉ์€ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ›์•„ ์ฝ”๋“œ ์ž‘์„ฑ์„ ์™„๋ฃŒํ•˜์˜€์œผ๋ฉฐ ์ตœ์ข… ์™„์„ฑ๋œ ์ฝ”๋“œ์— ๋Œ€ํ•œ ์„ค๋ช…์ด๋‹ค.


์ˆ˜์ •์‚ฌํ•ญ

1. ํ•จ์ˆ˜๋ช…์„ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณ€๊ฒฝ

2. ์›๋ณธ ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ํ˜•์‹์˜ ๋ผ์ธ ๋‘ ๊ฐ€์ง€์— ๋Œ€ํ•œ ์ฒ˜๋ฆฌ ํ•จ์ˆ˜ + ์‹œ๊ฐ„ ํ˜•์‹ ๋ณ€ํ™˜ ํ•จ์ˆ˜ + ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜ ์ด 4๊ฐœ๋กœ ์„ค๊ณ„ ๋ณ€๊ฒฝ

 

์ฒ˜์Œ์— ๋งŒ๋“ค์—ˆ๋˜ ํ˜•์‹ ๊ตฌ๋ถ„ ๋ฐ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์˜ ์ฝ”๋“œ๋Š” ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๋˜, ํ•จ์ˆ˜๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋‚ด์šฉ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐฉ์‹, ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ํ™” ํ•˜์—ฌ columns์—์„œ ๋ผ์ธ์„ ์ง€์ •ํ•˜์—ฌ ๋ฐ˜๋ณต๋ฌธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•ด ๋ฐฐ์šธ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜ ์š”์•ฝ

1. extract_dates_inv

  • ๋‚ ์งœ ์ •๋ณด ์ถ”์ถœ
  • year, month, day, current_weekday ๊ฐ’ ์ถ”์ถœ
    • ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐ ์ •์ œ
      • ex) 2023๋…„ => 2023

2. extract_conversation

  • ๋ฉ”์‹œ์ง€ ๋ผ์ธ ์ •๋ณด ์ถ”์ถœ
  • name, time, message ๊ฐ’ ์ถ”์ถœ
    • ์›๋ณธ ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๋Œ€๊ด„ํ˜ธ ์ œ๊ฑฐ

3. convert_24hr

  • ์‹œ๊ฐ„ ํ•จ์ˆ˜ 12์‹œ๊ฐ„ ํ˜•์‹์„ 24์‹œ๊ฐ„ ํ˜•์‹์œผ๋กœ ๋ณ€๊ฒฝ
  • ์›๋ณธ ๊ฐ’ : ์˜คํ›„ 1:30
  • ํ•จ์ˆ˜ ์ฒ˜๋ฆฌ ๊ฐ’ : 13:30

 4. generate_dataframe

  • ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • for๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ผ์ธ๋ณ„๋กœ ์ˆœํ™˜ํ•˜๊ณ  ์กฐ๊ฑด์— ๋งž๋Š” ๋ผ์ธ ๊ฐ’์— ์•ž์— ๋งŒ๋“  ํ•จ์ˆ˜์˜ ๋ณ€์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.

 

ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

import pandas as pd
import warnings

 

์˜ค๋ฅ˜ ๋ฌด์‹œ ํ•จ์ˆ˜

warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)
  • DataFrame์˜ ์ผ๋ถ€๋ฅผ ์ˆ˜์ •ํ•  ๋•Œ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์ด ์›๋ž˜ DataFrame์— ์•ˆ์ „ํ•˜๊ฒŒ ์ ์šฉ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ์„ ์•Œ๋ฆฌ๋Š” ๊ฒฝ๊ณ ๋ฅผ ๋ฌด์‹œํ•˜๋Š” ์ฝ”๋“œ
  • loc ์ธ๋ฑ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋‹ค. 

โ–ผ SettingWithCopyWarning์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด๊ธฐ

๋”๋ณด๊ธฐ

์˜ˆ์‹œ ์ฝ”๋“œ

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df_slice = df[df['A'] > 1]
df_slice['B'] = 0

์œ„์˜ ์ฝ”๋“œ์—์„œ 'df_slice[ 'B' ] = 0' ์—์„œ df_slice๊ฐ€ ์›๋ž˜ df์˜ ๋ณต์‚ฌ๋ณธ์ธ์ง€ ๋ทฐ์ธ์ง€ pandas๊ฐ€ ํ™•์‹ ํ•  ์ˆ˜ ์—†๋‹ค.

- 'df'์˜ ๋ทฐ๋ผ๋ฉด? df_slice์— ๋Œ€ํ•œ ๋ณ€๊ฒฝ์ด df์—๋„ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค

- 'df'์˜ ๋ณต์‚ฌ๋ณธ์ด๋ผ๋ฉด? ์›๋ž˜ 'df'๋Š” ๋ณ€๊ฒฝ๋˜์ง€ ์•Š๋Š”๋‹ค.

- ignore ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ฒ˜๋ฆฌํ•  ์‹œ, (loc ์‚ฌ์šฉ)

df.loc[df['A'] > 1, 'B'] = 0

 

๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ฝ๊ธฐ

data = pd.read_table(r"๋ณธ์ธ์˜ ํŒŒ์ผ๋ช… ์‚ฝ์ž…ํ•˜๊ธฐ")
  • ๊ด„ํ˜ธ ์•ˆ์— ์ž์‹ ์ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ํ™” ํ•˜๊ณ  ์‹ถ์€ ๋‚ด์šฉ์„ ํŒŒ์ผ ๊ฒฝ๋กœ๋ฅผ ๋ณต์‚ฌํ•˜์—ฌ ๋ถ™์—ฌ๋„ฃ๊ธฐ ํ•˜๋ฉด ๋œ๋‹ค.

 

์นผ๋Ÿผ ๋ณ€๊ฒฝ

๋ถ„์„ํ•ด์•ผ ํ•  ๋‚ด์šฉ์€ ๋ผ์ธ๋ณ„๋กœ ์ž‘์„ฑ๋˜์–ด์žˆ๋Š” ๋‚ด์šฉ์ด๋‹ค. 

index๊ฐ€ rows์ด๊ณ  columns๊ฐ€ ์ „์ฒ˜๋ฆฌ ํ•ด์•ผ ํ•  ๋‚ด์šฉ์ด๊ธฐ ๋•Œ๋ฌธ์— data ํ”„๋ ˆ์ž„์˜ ์ปฌ๋Ÿผ๋ช…์„ 'text' ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

data.columns=['text']

 

Date ์ •๋ณด ์ถ”์ถœ ํ•จ์ˆ˜

ํ•จ์ˆ˜๋ช… : extract_dates_inv

์›๋ณธ ๋ฐ์ดํ„ฐ

--------------- 2023๋…„ 3์›” 24์ผ ๊ธˆ์š”์ผ ---------------

์ถ”์ถœ์— ํ•„์š”ํ•œ ๊ณผ์ •

  • ๋ฐ์ดํ„ฐ์™€ ๋ฌด๊ด€ํ•œ '-' ๊ธฐํ˜ธ์™€ ์•ž๋’ค ๊ณต๋ฐฑ ์ œ๊ฑฐ ํ›„ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฆฌ
  • ๋‚ ์งœ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ ์‹œ '๋…„', '์›”', '์ผ' ๋ฌธ์ž ์ œ๊ฑฐ
  • ์š”์ผ ๋‚ ์งœ ์ €์žฅ 
def extract_dates_inv(line):
    date = line.strip('- ').split(' ')
    
    year = date[0][:-1]
    month = date[1][:-1]
    day = date[2][:-1]
    
    current_weekday = date[3]
        
    return year, month, day, current_weekday

 

๋ฉ”์‹œ์ง€ ์ •๋ณด ์ถ”์ถœ 

ํ•จ์ˆ˜๋ช… : extract_conversation

์›๋ณธ ๋ฐ์ดํ„ฐ 

[์•„๋นต] [์˜คํ›„ 2:16] ๐ŸŠ ์‚ถ์˜ ๊ตํ›ˆ

์ถ”์ถœ์— ํ•„์š”ํ•œ ๊ณผ์ •

  • ๋ฐ์ดํ„ฐ์—์„œ ๋Œ€๊ด„ํ˜ธ ์ œ๊ฑฐ
  • ์ด๋ฆ„, ์‹œ๊ฐ„, ๋ฉ”์‹œ์ง€ ๋‚ด์šฉ ๋ณ€์ˆ˜๋กœ ๋ถ„๋ฆฌ
  • Error ๋ฐœ์ƒ ์‹œ, pass ํ•˜๋Š” try ํ•จ์ˆ˜ ์‚ฌ์šฉ

โ–ผ try, except ์‚ฌ์šฉ ์ด์œ 

๋”๋ณด๊ธฐ

๋งŒ์•ฝ, ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ํ•จ์ˆ˜๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด

def extract_conversation(line):
    sender, rest = line.split('] [', 1)
    name = sender[1:]  # ์•ž์˜ '[' ์ œ๊ฑฐ
    time, message = rest.split('] ', 1)

    return name, time, message

ValueError: not enough values to unpack (expected 2, got 1)

ํ•ด๋‹น ๋ฌธ๊ตฌ์™€ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

์ด๋Š” ๋ณ€์ˆ˜์˜ ์ˆ˜์™€ ์‹ค์ œ ์ œ๊ณต๋˜๋Š” ๊ฐ’์˜ ์ˆ˜๊ฐ€ ์ผ์น˜ํ•˜์ง€ ์•Š์„ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค.

์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด ํ˜•์‹์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฒƒ๋„ ์žˆ์ง€๋งŒ, ์žฅ๋ฌธ ํ…์ŠคํŠธ์˜ ๊ฒฝ์šฐ ํ˜•์‹๋Œ€๋กœ๊ฐ€ ์•„๋‹Œ ๊ธ€์ž๋งŒ ์ž…๋ ฅ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์žฅ์„ ๋ชจ๋‘ ๋ฌด์‹œํ•˜๊ธฐ ์œ„ํ•ด์„œ try, expcept ๊ตฌ๋ฌธ์„ ์‚ฌ์šฉํ•œ๋‹ค.

def extract_conversation(line):
    try:
        sender, rest = line.split('] [', 1)
        name = sender[1:]  # ์•ž์˜ '[' ์ œ๊ฑฐ
        time, message = rest.split('] ', 1)
    
        return name, time, message
    
    except ValueError:
        pass

 

์‹œ๊ฐ„ ํ˜•์‹ ๋ณ€ํ™˜ ํ•จ์ˆ˜ 

ํ•จ์ˆ˜๋ช… : convert_24hr

์›๋ณธ ๋ฐ์ดํ„ฐ๊ฐ€ ์˜ค์ „, ์˜คํ›„๊ฐ€ ์ ํ˜€์žˆ๋Š” 12์‹œ๊ฐ„ ํ˜•์‹์˜ ์‹œ๊ฐ„ ํ‘œ์‹œ์ด๋‹ค.

์ด๋ฅผ 24์‹œ๊ฐ„ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•œ๋‹ค.

 

์šฐ์„  ํ•ด๋‹น ํ•จ์ˆ˜์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์€ '์˜ค์ „ 12:00' ์™€ ๊ฐ™์€ ํ˜•์‹์˜ ๊ฐ’์ด ๋“ค์–ด๊ฐˆ ๊ฒƒ์ด๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ์ผ๋‹จ ์˜ค์ „ / ์˜คํ›„ ๋ฌธ์ž๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ์˜ค์ „ 12์‹œ๋Š” 00์‹œ๋กœ, ์˜คํ›„ ๊ฐ’์€ ๋ชจ๋‘ +12๋ฅผ ํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•ด์•ผ ํ•œ๋‹ค.

if๋ฌธ์„ ํ™œ์šฉํ•˜์—ฌ ์˜ค์ „ ์˜คํ›„ ์กฐ๊ฑด์„ ๋‚˜๋ˆ„๊ณ  ๊ฐ’์„ ":" ๋ฌธ์ž๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ„๋ฉด hours์™€ minutes๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ hours์— +12๋ฅผ ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ˜„ํ•ด์•ผ ํ•œ๋‹ค.

#์‹œ๊ฐ„ 24์‹œ๊ฐ„ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
def convert_24hr(time_str):
    if '์˜ค์ „' in time_str:
        time = time_str.replace('์˜ค์ „ ', '')
        hours, minutes = map(int, time.split(':'))
        if hours == 12:  # ์˜ค์ „ 12์‹œ๋Š” 00์‹œ๋กœ ๋ณ€ํ™˜
            hours = 0
    elif '์˜คํ›„' in time_str:
        time = time_str.replace('์˜คํ›„ ', '')
        hours, minutes = map(int, time.split(':'))
        if hours != 12:  # ์˜คํ›„ 12์‹œ๋Š” ๊ทธ๋Œ€๋กœ ๋‘๊ณ , ๋‚˜๋จธ์ง€๋Š” 12๋ฅผ ๋”ํ•จ
            hours += 12
    return f"{hours:02}:{minutes:02}"

 

์ตœ์ข… ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

์œ„์— ์ž‘์„ฑ๋œ ํ•จ์ˆ˜๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

ํ•จ์ˆ˜๋ช… : generate_dataframe

dataframe์ด๋ผ๋Š” ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค๊ณ  year, ~ , current_weekday์ธ ๋‚ ์งœ ๋ณ€์ˆ˜ ๊ฐ’๋“ค์„ None์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.

dataframe = []
year, month, day, current_weekday = None, None, None, None

 

์ดํ›„ index์™€ ํ–‰์„ ๊ตฌ๋ถ„ํ•˜์—ฌ index๋Š” ๋ฐ˜๋ณต ์ค‘์ธ ํ–‰์˜ ์ธ๋ฑ์Šค๋ฅผ, rows๋Š” ๋ฐ˜๋ณต ์ค‘์ธ ํ–‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•˜๊ธฐ ๋•Œ๋ฌธ์— rows์— columns๋ฅผ ์ง€์ •ํ•˜์—ฌ ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•œ ๊ฐ ๋ผ์ธ์„ line์ด๋ผ๋Š” ๋ณ€์ˆ˜์— ์ €์žฅํ•œ๋‹ค.

for index,rows in data.iterrows():
    line = rows['text'].strip()

 

line์ด๋ผ๋Š” ๋ณ€์ˆ˜์— ์กฐ๊ฑด์„ ๋‹ฌ์•„์„œ ๋‘ ๊ฐ€์ง€ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค.

๋จผ์ € ๋‚ ์งœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•˜์ดํ”ˆ์œผ๋กœ ์‹œ์ž‘ํ•˜๋Š” line์„ ์ฐพ๋Š”๋‹ค.

if line.startswith('---------------'):
    year, month, day, current_weekday = extract_dates_inv(line)
    month = str(month).zfill(2)
    day = str(day).zfill(2)

๋‚ ์งœ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์„ฑ๋œ extract_date_inv ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ ธ์™€์„œ year, month, day, current_weekday๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์›” / ์ผ์€ ๋‘ ์ž๋ฆฌ ์ˆ˜๋กœ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด .zfill(2)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋นˆ ์ž๋ฆฌ๋Š” 0์œผ๋กœ ์ฑ„์›Œ์„œ ์ถœ๋ ฅ๋˜๋„๋ก ํ•œ๋‹ค.

 

๋ฉ”์‹œ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋Œ€๊ด„ํ˜ธ๋กœ ์‹œ์ž‘ํ•˜๋Š” line์„ ์ฐพ๋Š”๋‹ค.

๋ฉ”์‹œ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜์ธ extract_conversation ํ•จ์ˆ˜๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ๋ณ€์ˆ˜๊ฐ’์„ ๋ชจ๋‘ result์— ์ €์žฅํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ €์žฅ๋œ ๊ฐ’์ด ์˜ฌ๋ฐ”๋ฅธ ํ•จ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด data ์— ์ถ”๊ฐ€๋˜์–ด์„œ๋Š” ์•ˆ๋œ๋‹ค. 

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— if๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ name, time, message ํ˜•์‹์ด ๋งž์œผ๋ฉด dataframe์— ์ถ”๊ฐ€ํ•˜๊ณ , ์•„๋‹ˆ๋ฉด ๋„˜์–ด๊ฐ€๋„๋ก ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•œ๋‹ค.

elif line.startswith('['):
        result = extract_conversation(line)
        if result :
            name, time, message = result
            time_24hr = convert_24hr(time)
            dataframe.append([year, month, day, current_weekday, time_24hr, name, message])
        else :
            continue

 

์ตœ์ข… ์ถœ๋ ฅ๊ฐ’

(message ๊ฐ’์€ ๋Œ€ํ™” ๋‚ด์šฉ์ด๋ฏ€๋กœ ์ˆจ๊น€ ์ฒ˜๋ฆฌ ํ•˜์˜€์Šต๋‹ˆ๋‹ค)

์œ„์™€ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.