Keras for R, IMDB 리뷰 데이터의 긍정/부정 예측하기(LSTM 아님)
온라인 리뷰분석
개요
수업에 활용할 목적으로 https://keras.rstudio.com/articles/tutorial_basic_text_classification.html 아티클을 일부 수정했습니다. IMDB의 영화 리뷰 데이터를 바탕으로 긍정평가(1)와 부정평가(0)를 예측하는 딥러닝 모델입니다.
데이터 준비
필요한 라이브러리들을 불러옵니다. 그리고 IMDB데이터 셋을 가져옵니다. 단어는 1만개까지만 고려합니다(num_words=10000).
library(keras)
library(dplyr)
library(ggplot2)
library(purrr)
imdb<-dataset_imdb(num_words=10000)
학습용과 검정용으로 분리합니다.
c(train_data,train_labels)%<-%imdb$train
c(test_data,test_labels)%<-%imdb$test
단어는 전부 숫자로 치환되어 있습니다. 원래 글로 되돌려 보기 위한 준비를 합니다.
word_index<-dataset_imdb_word_index()
word_index_df<-data.frame(
word=names(word_index),
idx=unlist(word_index,use.names=F),
stringsAsFactors=F)
word_index_df<-word_index_df%>%
mutate(idx=idx+3)%>%
add_row(word="<PAD>",idx=0)%>%
add_row(word="<START>",idx=1)%>%
add_row(word="<UNK>",idx=2)%>%
add_row(word="<UNUSED>",idx=3)%>%
arrange(idx)
함수를 하나 만들어두면 편리하겠죠?
decode_review<-function(text) {
paste(map(text,function(number) {
word_index_df%>%
filter(idx==number)%>%
select(word)%>%
pull()
}),collapse=' ')
}
패딩
텍스트 데이터를 딥러닝하려면 같은 길이로 텍스트를 만들어야 합니다. 패딩은 0으로 되어 있으니까 사실 pad_sequence(test_data,maxlen=256,padding=‘post’)만 해도 됩니다. 왜냐하면 value의 디폴트가 0이니까요. 하지만 좀더 일반적인 경우를 생각해서 보기와 같이 코딩합니다. pull()함수에 대한 보다 자세한 내용은 블로그의 다른 글을 참고하세요.
train_data_ps<-pad_sequences(
train_data,
value=word_index_df%>%
filter(word=="<PAD>")%>%
select(idx)%>%
pull(),
padding="post",
maxlen=256
)
test_data_ps<-pad_sequences(
test_data,
value=word_index_df%>%
filter(word=="<PAD>")%>%
select(idx)%>%
pull(),
padding="post",
maxlen=256
)
모델링
케라스로 모델링을 합니다. Word embedding을 통해 10000개 단어를 16개 벡터 변수로 변환합니다(layer_embedding). Word2Vec과 같은 워드 임베딩은 딥러닝 계산의 효율성을 높여줍니다. 그리고, 계산 속도 향상을 위해 16개 벡터를 1차원 벡터로 다시 정리합니다(layer_global_average_pooling_1d). 이것을 리플렉트하는 16개의 노드를 레이어로 넣고, 최종적으로 확률값을 구하기 위해 sigmoid를 활성화함수로 하는 출력 레이어를 설계합니다.
model<-keras_model_sequential()
model%>%
layer_embedding(
input_dim=10000,
output_dim=16)%>%
layer_global_average_pooling_1d()%>%
layer_dense(units=16,activation='relu')%>%
layer_dense(units=1,activation='sigmoid')
model%>%compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=list('accuracy')
)
학습
이제 학습을 수행하겠습니다.
history<-model%>%
fit(
train_data_ps,train_labels,
epochs=40,
batch_size=128,
verbose=FALSE,
validation_split=0.2,
callbacks=list(
callback_lambda(
on_epoch_end=function(e,l){
if(e%%20==0) cat('\n')
cat('.')
})
))
....................
....................
검증
테스트 데이터로 검증을 해보니, 정확도가 84%입니다. LSTM과 같은 좀더 정교한 방법이 필요합니다.
model%>%evaluate(test_data_ps,test_labels,verbose=0)
$loss
[1] 0.81172
$acc
[1] 0.84552
예측
실제로 결과를 예측하겠습니다.
model%>%predict(test_data_ps[1:15])%>%round(3)
[,1]
[1,] 0.000
[2,] 0.000
[3,] 0.000
[4,] 0.000
[5,] 0.000
[6,] 0.000
[7,] 0.000
[8,] 0.000
[9,] 0.000
[10,] 1.000
[11,] 0.000
[12,] 0.000
[13,] 0.000
[14,] 0.997
[15,] 0.000
9번은 부정적 의견(값=0)입니다. 내용을 살펴보니 “mainly of trick photography the only outstanding positive feature which survives is its beautiful color and clarity sad to say of the many films made in this genre few of them come up to alexander”라고 하며 상당히 부정적 태도를 취하는 것을 알 수 있습니다.
decode_review(test_data[[9]])
[1] "<START> hollywood had a long love affair with bogus <UNK> nights tales but few of these products have stood the test of time the most memorable were the jon hall maria <UNK> films which have long since become camp this one is filled with dubbed songs <UNK> <UNK> and slapstick it's a truly crop of corn and pretty near <UNK> today it was nominated for its imaginative special effects which are almost <UNK> in this day and age <UNK> mainly of trick photography the only outstanding positive feature which survives is its beautiful color and clarity sad to say of the many films made in this genre few of them come up to alexander <UNK> original thief of <UNK> almost any other <UNK> nights film is superior to this one though it's a loser"
원래 리뷰어가 줬던 값도 0입니다.
test_labels[[9]]
[1] 0
이제 긍정(값=1)이라고 분류했던 것을 살펴보겠습니다. 상당히 긍정적이며 “both as himself and as the batman the four principals turn in excellent performances especially walken”와 같이 표현도 해줍니다.
decode_review(test_data[[10]])
[1] "<START> this film is where the batman franchise ought to have stopped though i will <UNK> that the ideas behind batman forever were excellent and could have been easily realised by a competent director as it turned out this was not to be the case br br apparently warner brothers executives were disappointed with how dark this second batman film from tim burton turned out apart from the idiocy of expecting anything else from burton and the conservative <UNK> of their subsequent decision to turn the franchise into an homage to the sixties tv series i fail to understand how batman returns can be considered at all disappointing br br true it is not quite the equal of the first film though it <UNK> all the minor <UNK> of style found in batman a weaker script that <UNK> the <UNK> between not just two but three characters invites <UNK> comparisons to the masterful pairing of keaton and jack nicholson as the joker in the first film yet for all this it remains a <UNK> dark film true to the way the batman was always meant to be and highly satisfying br br michael keaton returns as the batman and his alter ego bruce wayne with <UNK> max <UNK> christopher walken named in honour of the 1920s german silent actor his partner in crime <UNK> <UNK> the penguin danny <UNK> in brilliant makeup reminiscent of laurence <UNK> richard iii and <UNK> kyle the <UNK> michelle pfeiffer whom wayne romances both as himself and as the batman the four principals turn in excellent performances especially walken and <UNK> while together keaton and pfeiffer explore the darker side of double identities br br there are some intriguing concepts in this film about the only weakness i can really point out is a certain to the script in some places which i think is due mostly to the way this film is a four <UNK> fight there simply isn't enough time to properly explore what's going on br br nevertheless this is a damn good film i highly recommend watching this in <UNK> with the first and then <UNK> for how good the series could have been had it continued under burton and keaton"
실제 값도 1입니다.
test_labels[[10]]
[1] 1
댓글
댓글 쓰기