The world was amazed and horrified when the first AI language models began to emerge, mimicking human speech and writing. Some models that have ended up depending on companies like OpneAI and Google.
In Google I/O 21, Google itself presented LaMDA, an AI system that can interact naturally with users, almost on any topic. Although Google itself has been involved in some controversies in this regard, such as the presence of sexism, homophobia and racism in the main AI models.
For this reason, more than 500 scientists from around the world are working under the umbrella of the BigScience project — led by a startup known as Hugginface — which aims to create fully open artificial intelligence language models for the community and not dependent on companies like OpenAI And Google.
To understand what we are dealing with, we must apply some context. Google intends to integrate LaMDA into virtually all of its services — search engine, its assistant and even its collaborative work platform, Google Workspace. The idea is that users have an interface from which they can get information on Google services by asking LaMDA.
This involves several issues. The first is the most obvious; that these technologies based on language processing will end up in our day to day life, directly affecting us. A large language model (LLM) that will be directly involved in our daily services and that, in turn, depends on a large company such as Google.
Another issue is something that has also been discussed previously, and is the inclusion of extremist ideas in language models. AI — being based primarily on large amounts of data — has collected some of the most toxic ideas of the human being: for example systems with racist or sexist patterns have been discovered.
Due to the very size of these languages, there is a collateral problem; the massive production of misinformation plagued by these concepts. Without going any further, this fact was pointed out by Timnit Gebru, co-director of Google’s ethical AI division in an article. He was kicked out by Google itself, after refusing to retract its own post.
And all this, focusing only on Google. We have the cases of OpenAI language models, such as GPT-2 and GPT-3, capable of generating text with convincing human language. Microsoft is granting exclusive licenses to GPT-3 for unannounced products and Facebook is developing its own LLMs for content moderation and translation.
There are few studies or investigations that have analyzed in depth how these problems of language models are related and the fact that they apply in almost all aspects of our daily lives. Google has already made it clear. The few companies with the resources to develop these huge language models have financial interests and don’t scrutinize them as much as they should.
This is where BigScience comes in. This project seeks to change the focus to one of “open science” — understand natural language applied to AI fields and build an open source LLM language model, without depending on any type of company or product. This project was born specifically as a response to the lack of scrutiny by large companies against LLMs.
The idea is that BigScience wants to develop an entire open source LLM with different applications, such as conducting critical research independently of companies. Still, they have already secured a grant to develop it using a supercomputer owned by the French government.
The BigScience LLM will be characterized by its multilingual character. Up to 8 languages have been selected with their respective language families, including English, Chinese, Arabic, Hindi, and Bantu. The plan is to work closely with each language community to draw as many of their regional dialects as possible and ensure that their various data privacy standards are respected.
And no, this LLM does not intend to fight LaMDA or GPT-3. In fact, it is most likely not even useful for business, but for scientific research. As if that were not enough, none of these researchers is receiving remuneration as such for this project, since they are volunteers. The French grant only includes computational resources.
The BigScience members hope that by the end of the project — expected, until the end of May 2022 — there will be not only in-depth research on the limitations and implications of LLMs, but improved tools to develop and implement them responsibly.