"Failure to Fail": AI-Powered Language Tools & Humanitarian Applications
It’s fair to say that the past year has been a boon for “AI”-powered tools,1 which have skyrocketed both in popularity and adoption. In particular, broadening availability of multipurpose Large Language Models (LLMs) have spurred on the development of tools that provide text editing, summarizing, translation and generation—including in languages other than English.
That said, machine-assisted translation tools like Google Translate have been around for a long time. So, what’s the big deal with LLM-powered language tools? This article explores the inner-workings of these tools, demonstrating the expected and unexpected ways in which they fail and what this means for their adoption and use, especially by public service providers and humanitarian actors.
Modern Machine-Assisted Translation & LLMs
Modern machine assisted translation tools and LLMs both operate by leveraging foundational technologies derived from the field of Machine Learning (ML); however, they differ significantly in terms of their application and resulting outputs.
Both technologies rely on being trained on “ground truth” data, using features derived from these training corpora to produce a predictive algorithm. Machine translation tools are primarily trained using parallel (bilingual) texts verified by human translators and instructed to maximize fidelity between the model’s predicted translation and the verified ground truth translation.
LLMs, on the other hand, are trained by ingesting a massive (and largely unstructured) corpus of texts ranging from Wikipedia articles to social media posts. The broad range of unsupervised content ingested by LLMs provides them with the ability to mimic coherent language across many different topics and even different languages and tones by predicting the next word (token) in a sequence. What LLMs gain in terms of flexibility and seeming fluidity, they often lack in specificity and accuracy.
This difference in training and overall objective leads to disparities in performance on translation tasks. Typically, machine translation tools outperform LLMs in delivering translations with higher quality, consistency, and fidelity to the original text.2
Of course, the seeming fluidity with which products like ChatGPT respond to complex prompts is impressive, imbuing these technologies with an almost intimidating allure–especially if the last time you interacted with an “automated assistant” involved calming your breath as you repeat “Cancel subscription” to an incapable voice recognition service or clicking through increasingly irrelevant options in a chat-like interface.
However, despite the fact that LLMs exhibit even less fidelity than purpose-made translation tools like Google Translate or DeepL, their framing by evangelists and product marketers as general “do-it-all” tools and ability to produce believable (if not factually accurate) outputs have significantly muddied the waters and confused consumers.
Less-Represented Languages & Knowledge Gaps
The purpose-fit training design and quality control protocols of most machine translation (MT) tools like Google Translate provide some level of confidence in the accuracy of their outputs. By concentrating on specific language pairs and requiring ample bilingual corpora for training, these platforms set a threshold for performance, inherently providing a safeguard against the release of under-performing translation models. This offers a buffer against the propagation of low-quality translations—a thin layer of protection that is inherently absent in more generalist tools.
The lack of such strict protocols in LLMs leads to the production of outputs across languages, regardless of the volume or quality of available training data. This “attempt anyway” approach can be especially problematic for languages that are underrepresented online, which is most languages.3
While the capacity of LLMs to handle multiple languages—even those for which they have minimal training data—is often touted as a strength, it can also be a critical vulnerability. Lacking comprehensive and high-fidelity training data, by default LLMs will generate content that may carry a veneer of legitimacy while being fundamentally flawed or inaccurate.
Failure to Fail
The combination of seeming sophistication and the absence of hard-coded checks of ‘guardrails’ lead LLMs to “fail to fail”: these models are trained to generate the next likely word/token, and they will do so even when their training data is insufficient to identify significant features enabling a “meaningful” prediction. In AI parlance, this phenomenon is referred to as “hallucination”, although many argue that this anthropomorphic term only further serves to cloud the general public’s understanding of the techniques underlying LLMs.
To provide a real world example, I once asked ChatGPT to provide an explanation for the Turkish phrase, “bir şeyden ibaret olmak” meaning “to consist of X” or “merely be X”. While the explanation provided was correct, there were some odd mistakes in one of the example sentences it provided: “Bu olay sadece bir malentendirme ibaret.” (English: This matter is merely a ‘malentendirme’.)
The word “malentendirme” doesn’t exist in Turkish. It’s a made up word seemingly generated by applying Turkish morphology to a Latin or Spanish root: “malentend(er)” and the Turkish nominalizing suffix “me”, used to turn a verb into a noun (similar to the English +ing gerund).4 Moreover, the sentence is not grammatically correct, since the phrase requires its object to be in the ablative case, marked by the ablative suffix “-den/-dan/-ten/-tan”.
In other words, LLMs are prone to producing outputs that seem plausible on the surface but which fall apart under expert scrutiny. This is one of the core critique of LLMs made in the now seminal article “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜”:5 LLMs frequently excel at natural language processing, but are not capable of natural language understanding.
Contrary to how it may seem when we observe its output, an LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.
This mode of failure exhibits a few worrying characteristics. “Failure to fail” means that incorrect information is embedded in seemingly coherent and otherwise correct content, making the detection of such errors increasingly difficult.
… MT [Machine Translation] systems can (and frequently do) produce output that is inaccurate yet both fluent and (again, seemingly) coherent in its own right to a consumer who either doesn’t see the source text or cannot understand the source text on their own. When such consumers therefore mistake the meaning attributed to the MT output as the actual communicative intent of the original text’s author, real-world harm can ensue.
There are, unfortunately, already innumerable stories highlighting the ways that errors in machine-assisted translations (and more generally, the improper application of algorithms trained on and reproducing bias) have put individuals in danger and even cost lives.
The risk of incorrect translation outputs has always existed, rather originating from human or machine error (I have personally been the producer of more than a few of them ).
However, the “convincing” nature of LLMs’ outputs combined with marketing-driven smoke-and-mirrors hype as well as genuine lack of understanding of these newly-emerging technologies has led to misapprehensions about their capabilities and, worryingly, to increasingly “silent” modes of failure.
This type of “silent failure”, paired with overconfidence in the fidelity of LLM-based language tools, is particularly dangerous because the motivation behind adopting such tools (or hiring a translator or interpreter) is almost always precisely the acknowledgement that one or one’s organization lacks the critical capabilities and capacity to produce and proof contents in the target language(s).
Thus, without supervisory capacity, it falls on the service recipient of the translated communication to sound the alarm that all is not right. This is especially troublesome as the power dynamic in the deployment of language services, particularly in public and humanitarian sectors, is often asymmetrical. Those receiving services, despite being in positions of less authority, often uniquely possess the experiential knowledge to detect failures in translation. However, this ability to discern does not necessarily equate to the practical power to correct or challenge such inadequacies. The consequence is a paradox where the most vulnerable, while being the first to recognize a service’s shortcomings, are also the ones who bear the brunt of its flaws, many times without a clear path to redress.
I think it’s important to remind oneself that AI-powered tools are just that: tools, which require both an understanding of their underlying mechanics and appreciation for the knowledge needed to wield them effectively. In applications involving AI-powered language tools, it’s particularly important that these products be adopted as utilities employed in the service of “machine-assisted translation”, rather than being perceived and deployed as a reliable input-output device. Doing so means hiring teams with the requisite linguistic and cultural knowledge to effectively and safely supervise and use such tools.
Beyond the obvious risks that inaccurate or improper translations pose, I would also encourage organizations to consider deeper “humanistic” aspects of their commitment to linguistic (and cultural) diversity and accessibility. A meaningful commitment to fostering accessibility and inclusion requires more engagement than simply translating written communications into other languages!6
Footnotes & Acknowledgements
If you found these topics interesting, I highly recommend following the reporting being done by the folks at “Rest of World”, in particular their collection of stories focusing on “Language, translation, and the digital age”. Many of these articles provided me with a point of entry for learning more about the topic.
“Artificial Intelligence” is an umbrella term covering a number of computational approaches to solving complex problems. This umbrella includes sub-fields like Machine Learning (ML) and Natural Language Processing (NLP). ↩
For a detailed exploration, I can’t recommend enough Russel Brandom’s article “What Languages Dominate the Internet?,” Rest of World, June 7, 2023, https://restofworld.org/2023/internet-most-used-languages/. ↩
At first sight, I was tempted to believe this could be a Turkified loan from Ladino, though that is not the case. ↩
It’s worth noting that other critical arguments are presented in this paper which I did not take into account here, including questions of ecological sustainability in the age of exponential LLM growth. Bender and Gebru et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21 (New York, NY, USA: Association for Computing Machinery, 2021), 610–23, https://doi.org/10.1145/3442188.3445922. ↩
I can’t help but find many attempts at linguistic inclusion to be woefully misguided and, frankly, somewhat condescending. Take for example NYC Mayor Eric Adam’s recent use of ML-powered voice models to “deepfake” his voice speaking to New Yorkers in a number of different languages which he does not actually speak: Katie Honan, “Tongue Twisted: Adams Taps AI to Make City Robocalls in Languages He Doesn’t Speak,” THE CITY - NYC News, October 16, 2023, http://www.thecity.nyc/2023/10/16/adams-taps-ai-robocalls-languages-he-doesnt-speak/. ↩