「吾輩は猫である」のbi-gramと条件付き確率を算出する

こちらは授業の追加課題。コードが汚くなってしまって悔しいので、後日修正を入れます。

問題

夏目漱石の小説『吾輩は猫である』の文章（neko.txt）に対して、単語 bi-gram とその条件付き確率をスムージングなしですべて求めて出力せよ。なお、表層形や品詞の違いは無視して、原形が同じ単語は同じと扱って良い。また文頭記号BOSと文末記号EOSは考えること。ただし、確率の大きい順に並べる必要はない。

キーワード

bi-gram
条件付き確率
スムージング

考え方

条件付き確率の算出には、単語ごとの文書全体での出現回数と、bi-gramごとの文書全体での出現回数が必要になる。結果としてはこれら２つで割り算をすれば求まるので、順に数値を求めていけばよい。

1. 形態素解析結果を取り込む
2. 文章ごとの単語リストを作成する
3. bi-gramのリストを作成する
4. 単語ごとの出現回数を計算する
5. bi-gramごとの出現回数を計算する
6. 条件付き確率を算出する

1. 形態素解析結果を取り込む

# MeCabのインポート
import MeCab
m = MeCab.Tagger('-Ochasen')

# ファイル名の指定
filename = 'neko.txt.mecab'

# ファイルの読み込みfilename = "neko.txt.mecab"
with open(filename,mode='rt',encoding='utf-8') as f:
    blockList = f.read().split('EOS\n') # 文章ごとに分割
blockList = list(filter(lambda x: x!='', blockList))

ここまではお決まりの文言。今回は事前にmecabで形態素解析をした「neko.txt.mecab」を利用した。
bi-gramの作成にあたり、文章ごとに処理をしたいので、blockは文章ごととしている。

2. 文章ごとの単語リストを作成する

# 文章ごとの単語リストの作成
res = []
for block in blockList:
    wordList = ["BOS"]
    for line in block.split("\n"):
        if line == "":
            continue
        base = line.split("\t")[1].split(",")[6]
        if base == "\u3000":
            continue
        wordList.append(base)
    wordList.append("EOS")
    res.append(wordList)

文章ごとに処理を回して、出現単語をリストに追加。最初と最後に、文頭と文末を意味する「BOS」と「EOS」を追加している。またこのとき全角スペースが入ると後の計算で邪魔なので「\u3000」は削除、また空白行ができてしまうのでそこも飛ばしている。

今回は形態素解析結果のうち、原型のみ利用するため、baseしか使わない。また元ファイルがmecabファイルなので、タブでsurfaceとfeatureに区切った後、後者からbaseだけを取り出している。

3. bi-gramのリストを作成する

# bigramの算出
def bigram(target):
    bigram = []
    for t in target:
        for i in range(len(t)-1):
            bigram.append([t[i],t[i+1]])
    return bigram

bigram = bigram(res)

変数名とか関数名の名付けに悩んだが、とりあえず今回は動けばいいやということで作成。他の課題でもbigramを作成する必要があったので関数とした。今回はbigramのみに対応。trigramなどは対応していないため、引数はリスト一つのみ。

格納形式はその後の処理を考えて[pre, post]のリスト形式とした。[[pre1,post1],[pre2,post2],......,[pren,postn]]と入っている感じになる。

4. 単語ごとの出現回数を計算する

def count_word(target):
    wcount = {}
    for t in target:
        for w in t:
            if w in wcount:
                value = wcount[w]
                wcount[w] = value + 1
            else:
                wcount[w] = 1
    return wcount

count_w = count_word(res)

ここでは単語ごとの全文書内での出現回数を算出している。事前にdictを作っておき、出てきた単語がdictにある場合はvalueに1を足す、ない場合は新しく割り当て、valueを1とする。この流れは何度も出てきたのでいい加減覚えた。

5. bi-gramごとの出現回数を計算する

def count_bigram(target):
    bigram_count = {}
    for bi in target:
        key = bi[1]+" | "+bi[0]
        if key in bigram_count:
            value = bigram_count[key]
            bigram_count[key] = value + 1
        else:
            bigram_count[key] = 1
    return bigram_count

count_bi = count_bigram(bigram)

今度はbi-gramの出現回数を数える。今回はbi-gramのリストがすでにあるので、これの中を探索していけばいい。基本的な考え方は上と同じ。また出力は条件付き確率の形式でkeyを保存する形式を選択した。

例えば、「吾輩は猫である。」の場合は下記のようになっている。
（吾輩｜BOS）（は｜吾輩）（猫｜は）（で｜猫）（ある｜で）（。｜ある）（EOS｜。）

これは単純にdict型のkeyに複数の変数を入れらなかったのと、複雑なリスト計算の方法がわからなかったので、まとめてkeyにしてしまった。

6. 条件付き確率を算出する

def calc_prob(bigram,count_w,count_bi):
    ans = {}
    for bi in bigram:
        key = bi[1]+" | "+bi[0]
        if key in ans:
            continue
        pre = bi[0]
        prb = count_bi[key]/count_w[pre]
        ans[key] = prb
    return ans

ans = calc_prob(bigram,count_w,count_bi)

# 答えの出力 例：P（は | 吾輩）、P（で | 猫）、P（BOS | 吾輩）
print("P(は|吾輩) = ",ans["は | 吾輩"])
print("P(で|猫) = ",ans["で | 猫"])
print("P(吾輩|BOS) = ",ans["吾輩 | BOS"])

最後に条件付き確率を計算する。
ここまで単語の出現数とbi-gramの出現数は算出しているので、どちらかをキーにして算出すればOK。今回は3で作ったbi-gramのリストが[pre,post]という形式で中身が保存されているので、これを順に出していくことにした。

まず5で作ったbi-gramごとの出現回数を呼び出せるように、[pre,post]を用いてkeyを作っておく。これを最終的な返り値でもキーとして利用する。あとは確率を計算して、順にkeyに割り当てていき返すだけ。

全部を確認してもわからないので、最初の「吾輩は猫である。」から
P(は|吾輩)、P(で|猫)、P(吾輩|BOS)の３つを出力してみた。

出力結果

P(は|吾輩) = 0.3887733887733888
P(で|猫) = 0.036290322580645164
P(吾輩|BOS) = 0.020304017372421282

ソースコード

# MeCabのインポート
import MeCab
m = MeCab.Tagger('-Ochasen')

# bigramの算出
def bigram(target):
    bigram = []
    for t in target:
        for i in range(len(t)-1):
            bigram.append([t[i],t[i+1]])
    return bigram

# 単語ごと出現回数の算出
def count_word(target):
    wcount = {}
    for r in res:
        for w in r:
            if w in wcount:
                value = wcount[w]
                wcount[w] = value + 1
            else:
                wcount[w] = 1
    return wcount

# bigramごと出現回数の算出
def count_bigram(target):
    bigram_count = {}
    for bi in target:
        key = bi[1]+" | "+bi[0]
        if key in bigram_count:
            value = bigram_count[key]
            bigram_count[key] = value + 1
        else:
            bigram_count[key] = 1
    return bigram_count

def calc_prob(bigram,count_w,count_bi):
    ans = {}
    for bi in bigram:
        key = bi[1]+" | "+bi[0]
        if key in ans:
            continue
        pre = bi[0]
        prb = count_bi[key]/count_w[pre]
        ans[key] = prb
    return ans

# ファイル名の指定
filename = 'neko.txt.mecab'

# ファイルの読み込みfilename = "neko.txt.mecab"
with open(filename,mode='rt',encoding='utf-8') as f:
    blockList = f.read().split('EOS\n') # 文章ごとに分割

# 空白行を削除する
blockList = list(filter(lambda x: x!='', blockList)) 

# 文章ごとの単語リストの作成
res = []
for block in blockList:
    wordList = ["BOS"]
    for line in block.split("\n"):
        if line == "":
            continue
        base = line.split("\t")[1].split(",")[6]
        if base == "\u3000":
            continue
        wordList.append(base)
    wordList.append("EOS")
    res.append(wordList)


# bi-gramの作成
bigram = bigram(res)

# 単語ごと出現回数の算出
count_w = count_word(res)

# bigramごと出現回数の算出
count_bi = count_bigram(bigram)

# 条件付確率の算出
ans = calc_prob(bigram,count_w,count_bi)

# 答えの出力 例：P（は | 吾輩）、P（で | 猫）、P（BOS | 吾輩）
print("P(は|吾輩) = ",ans["は | 吾輩"])
print("P(で|猫) = ",ans["で | 猫"])
print("P(吾輩|BOS) = ",ans["吾輩 | BOS"])

SHOT4

社会人大学院生の勉強記録