Image
Researchers at Humanistiska fakulteten
Photo: Johan Wingborg
Breadcrumb

Language models with a human touch

Published

The new impressive GPT models produce texts which appear to be written by a person. But can the models really create human language?
This is something researchers at the Faculty of Humanities are investigating.

Generative Pre-trained Transformers, GPT:s, are advanced language models available in several versions. The most recent ones, GPT3 and GPT4, are created by the company Open AI and have been trained on tens of billions of text samples. In Sweden as well, similar models are developed, such as GPT SW3, created by AI Sweden, and Bert, produced by the National Library.

Although the texts that the language models produce seem surprisingly natural, they don’t function at all like human language, explains Simon Dobnik, Professor of Computational Linguistics.

– Unlike humans, the models don’t have access to the surrounding world, but instead work by finding statistical relationships in huge collections of text. The quality of an answer depends on the probability that the word sequences appear in the text form that the models have learned, i.e. not on how likely it is that something is true in reality.

Know nothing about reality outside

The language models approximate meaning according to the distributional hypothesis, explains Nina Tahmasebi, associate professor in language technology.

– The hypothesis means that a word like “chair” often appears in the same context as a word with a similar meaning, like “table”. The meaning is determined indirectly: If “table” and “chair” appear at the same time in a text describing the world, then there is a real relationship between the words. However, the GPT models cannot figure out that a banana is yellow. This is because the models know nothing about reality outside their text world.

Simon Dobnik
Unlike humans, the models don’t have access to the surrounding world, Simon Dobnik points out.
Photo: Johan Wingborg

Risks with the language models

Image
Asad Sayeed
Asad Sayeed.
Photo: Johan Wingborg

Recently, 1,880 researchers signed a letter in which they point out the risks of letting the language models take over various human activities.

But there are many other problems, Asad Sayeed, senior lecturer in computational linguistics, points out.

– I recently asked the GPT3 model to write something about the female presidents of France. In response, I received a text with a number of names that looked credible. When I instead asked about female presidents in North Korea, I got the correct answer, that there have never been any. That the answer was correct this time is due to the fact that there is not much information about North Korea on the internet and thus, paradoxically, nothing to build an incorrect conclusion on either.

High costs

The massive language models that multinational companies such as Microsoft, Google and Open AI develop cost an enormous amount, both in money and energy, explains Felix Morger, PhD student in language technology.

– The language models use cloud services from servers in data centers all over the world where the information is constantly being processed. The costs are so high that only the largest international companies can make such investments – not even a country as large as the United States keeps up when the global giants compete to be first and best at launching new products. There are calculations that show that the training of GPT3 consumed as much energy as 120 American homes do annually.

Not a good representation of humanity

However, exactly how much the systems cost and what environmental impact they have is a secret, explains Simon Dobnik.

– There is also no precise information about what material the models are based on or how much human monitoring there is, for example of material with unwanted social bias. When this kind of information is not shared with the research community, we also cannot find out if there are ways to build models that are just as good, using smaller volumes of data, less human interaction, and a lower environmental footprint.

The models also do not give a very good representation of humanity.

– An overwhelming majority of the texts the models train on come from the English-speaking Western middle class, while subcultures or other cultures and languages are poorly represented. Inequality is further exacerbated by the fact that it is people in the rich world who can afford to use new technology and not particularly citizens of developing countries.

To handle issues such as how the models work, what bias they contain and what they can be used for, language technology and humanistic knowledge is necessary, says Nina Tahmasebi.

– Language is not about statistical relationships between different words, but about how we humans communicate and relate to the outside world. Therefore, knowledge about humans and our world cannot be omitted from a language model.

Smaller models

The research conducted at GU is about building smaller language models and investigating what they can learn about language and human communication. With the help of knowledge in linguistics, psychology and various social sciences, the models can be improved and various biases can be discovered and counteracted. And because the models that linguists use are based on deep analyses of how language works, the huge material the global companies use is not needed, Nina Tahmasebi points out.

– They are thus possible to use both for researchers and companies. We build models that interpret what is reasonable by thinking of text as language and not just data. In this way, we can make models that are cheaper, more environmentally friendly and also more reliable.

Several ongoing projects

One of the ongoing projects at the Faculty of Humanities is SuperLim 2.0, which is mostly finished, says Felix Morger.

– It is about producing a collection of material in Swedish that can test models for language understanding. The test quantities are already available and the website to upload and compare results will also be online soon.

Another project, Granny Karl is 27 years old, is about pseudonymisation of research data, says Simon Dobnik.

– The idea is to create language technology algorithms that detect personal data and sensitive information in large masses of text and replace the words with suitable pseudonyms, without changing the meaning of the text or introducing more bias.

Another project is about reinforcement learning, where machine learning takes place through interaction with the environment, says Asad Sayeed.

– For example, we are looking at how artificial agents learn to name different colours in a guessing game, consisting of one narrator and one listener, where both agents are rewarded when they reach an agreement.

The humanistic perspective is fundamental

The humanistic perspective is fundamental to creating language models that are truly reliable and useful, Nina Tahmasebi states.

– We can use existing texts smarter and more equally, prevent the spread of false information and various types of prejudice, and start from what people need rather than from what can be done technically. We are not looking to put an end to commercial products, but we believe that it is important to examine both what they can actually do and what they are suitable for. Beyond being used as a fun gimmick, these methods can provide answers to deep, complex research questions that contribute to society.

Text: Eva Lundgren

The article was first published in GU Journalen, no 2, 2023

Fakta

Nina Tahmasebi, is Associate Professor of Language Technology, and Programme Manager for Change is Key! The study of contemporary and historical societies using methods for synchronic semantic change, which was granted SEK 33.5 million by Riksbankens Jubileumsfond.

Simon Dobnik is Professor of Computational Linguistics at the Centre for Linguistic Theory and Studies in Probability (CLASP) and a participant in the project Mormor Karl är 27 år: Automatisk pseudonymisering av forskningsdata. (Granny Karl is 27 years old: Automatic pseudonymisation of research data).

Felix Morger is a doctoral student at Språkbanken Text and involved in the project SuperLim 2.0.

Asad Sayeed is Senior Lecturer in Computational Linguistics at CLASP and researches language and mental images.