Predicting visual context for unsupervised event segmentation in continuous photo-streams
Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual fe...
Saved in:
Main Authors: | , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2018
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/5466 https://ink.library.smu.edu.sg/context/sis_research/article/6469/viewcontent/1808.02289.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-6469 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-64692020-12-24T03:02:05Z Predicting visual context for unsupervised event segmentation in continuous photo-streams DEL MOLINO, Ana García LIM, Joo-Hwee TAN, Ah-hwee Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than 1.5 million images spanning over 1,723 days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is only 3 points below that of human annotators. 2018-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5466 info:doi/10.1145/3240508.3240624 https://ink.library.smu.edu.sg/context/sis_research/article/6469/viewcontent/1808.02289.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Lifelogging Event Segmentation Visual Context Prediction Databases and Information Systems Graphics and Human Computer Interfaces |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Lifelogging Event Segmentation Visual Context Prediction Databases and Information Systems Graphics and Human Computer Interfaces |
spellingShingle |
Lifelogging Event Segmentation Visual Context Prediction Databases and Information Systems Graphics and Human Computer Interfaces DEL MOLINO, Ana García LIM, Joo-Hwee TAN, Ah-hwee Predicting visual context for unsupervised event segmentation in continuous photo-streams |
description |
Segmenting video content into events provides semantic structures for indexing, retrieval, and summarization. Since motion cues are not available in continuous photo-streams, and annotations in lifelogging are scarce and costly, the frames are usually clustered into events by comparing the visual features between them in an unsupervised way. However, such methodologies are ineffective to deal with heterogeneous events, e.g. taking a walk, and temporary changes in the sight direction, e.g. at a meeting. To address these limitations, we propose Contextual Event Segmentation (CES), a novel segmentation paradigm that uses an LSTM-based generative network to model the photo-stream sequences, predict their visual context, and track their evolution. CES decides whether a frame is an event boundary by comparing the visual context generated from the frames in the past, to the visual context predicted from the future. We implemented CES on a new and massive lifelogging dataset consisting of more than 1.5 million images spanning over 1,723 days. Experiments on the popular EDUB-Seg dataset show that our model outperforms the state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is only 3 points below that of human annotators. |
format |
text |
author |
DEL MOLINO, Ana García LIM, Joo-Hwee TAN, Ah-hwee |
author_facet |
DEL MOLINO, Ana García LIM, Joo-Hwee TAN, Ah-hwee |
author_sort |
DEL MOLINO, Ana García |
title |
Predicting visual context for unsupervised event segmentation in continuous photo-streams |
title_short |
Predicting visual context for unsupervised event segmentation in continuous photo-streams |
title_full |
Predicting visual context for unsupervised event segmentation in continuous photo-streams |
title_fullStr |
Predicting visual context for unsupervised event segmentation in continuous photo-streams |
title_full_unstemmed |
Predicting visual context for unsupervised event segmentation in continuous photo-streams |
title_sort |
predicting visual context for unsupervised event segmentation in continuous photo-streams |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2018 |
url |
https://ink.library.smu.edu.sg/sis_research/5466 https://ink.library.smu.edu.sg/context/sis_research/article/6469/viewcontent/1808.02289.pdf |
_version_ |
1770575468198625280 |