Image
Nikolai Ilinykh
Photo: Monica Havström
Breadcrumb

Nikolai Ilinykh: Computational Models of Language and Vision - Studies of Neural Models as Learners of Multi-Modal Knowledge

Culture and languages

Dissertation for Ph.D. in Computational Linguistics at the Faculty of Humanities, Department of Philosophy, Linguistics and Theory of Science. Welcome!

Dissertation
Date
11 Jun 2024
Time
13:00 - 18:00
Location
Room 222, Renströmsgatan 6

Respondent:
Nikolai Ilinykh, Institutionen för filosofi, lingvistik och vetenskapsteori

Thesis title:
Computational Models of Language and Vision: Studies of Neural Models as Learners of Multi-Modal Knowledge

Examining committee:
Professor, Lilja Øvrelid, University of Oslo
Professor, Jörg Tiedemann, University of Helsinki
Assistant Professor, Desmond Elliot, University of Copenhagen

Substitute if member in the committee will be missing:
Associate Professor Bahareh Afshari, Göteborgs universitet

Opponent:
Assistant Professor Carina Silberer, University of Stuttgart

Chair:
Professor Eleni Gregoromichelaki, Göteborgs universitet

Vision and language workshop: 
10 June 13:00 – 17:00
The Language and Perception research group at CLASP organises the workshop on language and vision tasks, data, and models, co-located with the doctoral defence of Nikolai Ilinykh. More information can be found on the external CLASP site:
https://gu-clasp.github.io/language-and-perception/events/language-and-vision-workshop/

 

Abstract:
This thesis develops and evaluates computational models that generate natural language descriptions of visual content. We build and examine models of language and vision to gain a deeper understanding of how they reflect the relationship between the two modalities. This understanding is crucial for performing computational tasks. The first part of the thesis introduces three studies that inspect the role of self-attention in three different self-attention blocks of the object relation transformer model. We examine attention heatmaps to understand how the model connects different words, objects, and relations within the tasks of image captioning and image paragraph generation. We connect our interpretation of what the model learns in self-attention weights with insights from theories about human cognition, visual perception, and spatial language. The three studies in the second part of the thesis investigate how representations of images and texts can be applied and learned in task-specific models for image paragraph generation, embodied question answering, and variation in human object naming. The last two studies in the third part examine properties of human-generated texts that multi-modal models are expected to acquire in image paragraph generation as well as perceptual category description and interpretation tasks. We analyse discourse structure in image paragraphs produced with different decoding methods. We also inspect whether models of perceptual categories can abstract from visual representations and use this knowledge to generate descriptions that exhibit discriminativity levels important for the task. We show how automatic measures for evaluating text generation behave in a comparison of model-generated and human-generated image descriptions. This thesis presents several contributions. We illustrate that, under specific modelling conditions, self-attention can capture information about the relationship between objects and words. Our results emphasise that the specifics of the task determine the manner and context in which different modalities are processed, as well as the degree to which each modality contributes to the task. We demonstrate that while favoured by automatic evaluation metrics in different tasks, machine-generated image descriptions lack the discourse complexity and discriminative power that are often important for generating better, human-like image descriptions.