用于视觉问答堆叠力机制研究.pdfVIP

下载本文档

0
0
约3.78万字
约 50页
2026-01-21 发布于北京
举报

用于视觉问答堆叠力机制研究.pdf

用于视觉问答的堆叠注意力机制

计算机科学系斯坦福大

学bingbin@stanford.edu余

文妮计算机科学系

斯坦福大学

weiniyu@stanford.edu

我们的项目探讨了在给定图像的情况下对多项选择题进行视觉问答。基于使

用不同机制的工作，我们首先构建了一个使用GloVe词嵌入[3]和ResNet图像特征

[4],的词袋模型与多层感知机[1,2]，该模型优于原始的带有注意力机制的LSTM模

型[5]。我们进一步扩展了这一模型，通过用LSTM替换语言模型并添加堆叠的空间

注意力层[6]来捕捉单词和图像区域之间的交互。我们在Visual7W数据集[5]上通过

实验多种不同的设置，研究了VQA任务的不同方面，并获得了有趣的结果。最后，

我们分析了哪些选项有助于更好的结果。

1引言

通过结合视觉和语言理解，这两种中最重要的输入模态，视觉问答

（VQA）自正式提出以来，在研究界引起了广泛[7]。在VQA设置下，模型

应能够回答关于图像的自然语言查询。与对象识别不同，自由形式的查询要求模

型具备超出分类离散的自然语言理解能力。与基于文本的问答不同，VQA进

一步要求模型找到文本描述和图像之间的语义联系，从而使得学习任务更加复

杂。

在这个项目中，我们对了解对VQA模型来说什么是重要的非常感。除了尝试在数据集上训

练一个性能良好的模型之外，我们还希望理解模型是从哪里获取信息来预测的，以及这些

信息如何

StackedAttentionforVisualQuestionAnswering

BingbinLiu

DepartmentofComputerScience

StanfordUniversity

bingbin@stanford.edu

WeiniYu

DepartmentofComputerScience

StanfordUniversity

weiniyu@stanford.edu

OurprojectexploresVisualQuestionAnsweringonmultiple-choicequestionsgivenan

image.Basedonrecentworksusingdifferentmechanisms,wefirstbuildaBag-of-Words

modelwithMLP[1,2]usingGloVewordembeddings[3]andResNetimagefeatures[4],

whichoutperformstheoriginalLSTMwithattentionmodel[5].Wefurtherextenditby

recingourlanguagemodelwithLSTMandaddstackedspatialattentionlayers

following[6]tocapturetheinteractionweenthewordsandimageregions.We

investiedifferentaspectsoftheVQAtaskontheVisual7Wdataset[5]by

experimentingwithmanydifferentsettingsandobtaininterestingresults.Finally,we

presentanalysisonwhichoptionscontributetoterresults.

1Introduction

Bycombiningvisualandlanguageunderstanding,twoofthemostimportantinput

modalitiesinartificialintelligence,visualquestionanswering(VQA)hassparkedwide

interestsinresearchcommunitysincethetermw

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

用于视觉问答堆叠力机制研究.pdfVIP