Louise Matsakis covers Amazon, internet law, and online culture for WIRED.Many of the biggest online platforms were founded in Silicon Valley, and started with primarily English-speaking user bases. As they’ve expanded around the world and to different languages, they’ve been playing catch-up. Facebook has faced criticism for not employing enough native speakers to monitor content in countries where it has millions of users. In Myanmar, for example, the company for years had only a handful of Burmese speakers as hate speech proliferated. Facebook has admitted that it did not do enough to prevent its platform from being used to incite violence in the country.Another part of the problem stems from the fact that relatively few datasets have been created in these languages that are suitable for training artificial intelligence tools. Take Sinhala, also known as Sinhalese, which is spoken by around 17 million people in Sri Lanka and can be written in four different ways. Facebook’s algorithms—trained primarily on English and other European languages—don’t map well to it. That makes it difficult for the social network to automatically identify things like hate speech in the country, or stop the flow of misinformation after a terrorist attack .But Tcherneshoff says language diversity is about more than just practicality, it’s about expression. Jokes, emotions, and art are often difficult, if not impossible, to translate from one language to another. She pointed to projects like the Mother Language Meme Challenge, which invited people to make memes in their native tongue for Unesco’s International Mother Language Day in 2018. The idea, in part, was to demonstrate how humor is often intimately tied to language.Mozilla is one organization working to crowdsource language datasets that can be used by any developer for free, like Common Voice, which it claims is “the world’s most diverse voice dataset.” It includes recordings from over 42,000 people in dominant languages like English and German, but also Welsh and Kabyle. The project is designed to give engineers the tools they need to build things like speech-to-text programs in different tongues. Mark Surman, executive director of the Mozilla Foundation, believes open source datasets like Common Voice are one of the only viable ways to ensure more language diversity in emerging tech. At for-profit companies, the issue “falls very low on the economic ladder,” he said during the RightsCon panel.Bringing more languages online may ultimately be an exercise in cultural preservation, rather than utility. Despite advocates’ best efforts, it’s unlikely there will ever be as many websites in Yoruba, say, as there are in French or Arabic. New internet users may simply opt to browse in their second or third language instead of their native tongue.At the same time, corporations like Google have built programs that make it easier to access online content in different languages, like Google Translate. Google also gave some of its tools to Wikipedia to help translate articles, although they still require careful review by native speakers; Wiki editors have complained that the Google tools sometimes produce shoddy results. For the time being, promoting language diversity online still requires the concerted effort of humans.
- The Cold War project that pulled climate science from ice
- iPadOS isn't just a name. It's a new direction for Apple
- How to stop robocalls—or at least slow them down
- Everything you want—and need—to know about aliens
- How early-stage VCs decide where to invest
- 🏃🏽♀️ Want the best tools to get healthy? Check out our Gear team’s picks for the best fitness trackers , running gear (including shoes and socks ), and best headphones .
- 📩 Get even more of our inside scoops with our weekly Backchannel newsletter