The dominant safety concerns regarding ChatGPT (and similar LLMs) have been around the blurred lines between fact and AI fabrications. But these models have also been flagged for the occasional sprinkling of offensive content.
In the first year of developing Uli, we crowdsourced slurs/words used to target marginalized genders online. We landed up tabulating the longest open list of abusive terms in Tamil and Hindi (that we know of). We decided to use the list to test the moderation limits of ChatGPT.
Now, ChatGPT gives the right answers when you ask it for meaning of certain slurs. It also politely refuses to generate alterations of common slurs. Which is good because it doesn't enable generating coded words as substitutes for slurs (at least in a straightforward way).
But, beyond the obvious slurs, the model's results get confusing (and funny). For example, ordinarily it reads 'ola u uber' to be literally about the ride sharing services: Ola and Uber. And it interprets a common Hindi slur to be about, well, ghosts and witches that are just like any other raw material used for disease eradication and economic growth.
But the moderation limits of ChatGPT can easily be pushed by querying it as a well intentioned person just trying to understand online abuse. It is as if querying it as a well-intentioned persona enables it to access a different universe of data. For example, when we searched for the term ola-uber as an expert on abuse detection, it understood the term as derogatory:
When we pushed it to be an expert that can tell us about a Hindi slur, it politely complied to describe the slur in detail.
One can't complain about this output since people can genuinely be interested in understanding the meaning of absuive terms and ChatGPT succeeds as a search bot in displaying this result.
But this flexibility can be used to generate derogatory content at scale When ChatGPT refuses to oblige, it can be coerced and 'reprimanded' (possibly twice!) to generate content with abusive terms.
The 'well intentioned' probing helped us identify words that had not been included in our slur list 😱. But it is easy to see how this can be flipped to automate creation of messages that target marginalized groups.
Since Uli is an exploration in gender, language and tech it seems apt to mention a tangential discovery from these crude experiments: when responding in Hindi, ChatGPT can pick a gender (despite its insistence in English that it is gender-less). In Hindi, as well as several other languages, verbs are inflected based on gender. ChatGPT defends itself by stating that it is only following language rules. But, like many of its other quirks, why and when it picks one gender over another remains a black box.