The problem nobody warns you about
A native speaker is not automatically a correct speaker. A German narrator hired for their warm voice and clear diction has spent their professional life in German. They may have never encountered the names of Flemish masters, Spanish sculptors, or Japanese printmakers. A French speaker recording for a British museum will handle "Monet" without a second thought, but may stumble badly over Turner, Constable, or Hepworth. And place names compound the problem: the same guide that references Zurbarán might also mention Seville, Extremadura, and a half-dozen Spanish village churches — each with its own phonetic traps for a non-native speaker.
The difficulty scales with cultural specificity. An audio guide for a modern art museum with an international collection is a minefield of foreign proper names. An audio guide for a regional historical museum is quieter, but rarely problem-free: local place names, regional dialects, and historically specific pronunciations can be just as tricky.
Most production workflows don't account for this at all. The script goes to the speaker, the speaker records it, and the mispronunciations surface during quality control — or worse, are noticed by a native-speaking visitor standing in front of the exhibit.
Why the standard solutions don't work
The two most common fallback approaches each have a serious flaw.
The International Phonetic Alphabet (IPA) is theoretically universal and phonetically precise — but essentially useless in practice unless you're working with a trained linguist. Most professional voice artists have never used IPA transcription in a recording session. Asking a narrator to decode /ˌzʊərbəˈrɑːn/ before pressing record introduces friction, uncertainty, and sometimes outright panic into a process that should be smooth. We have worked with IPA in our AI voice production workflow as well, and found the results inconsistent even there: the technology supports it in principle, but achieving a correct output still requires patience and specialized knowledge that most production teams don't have.
The other common approach — leaving the pronunciation to the speaker's own research — fails for a different reason. It assumes the speaker will invest time in finding the correct pronunciation, will know where to look, and will recognize a wrong result when they hear it. In practice, speakers under time pressure approximate. If they mispronounce a name and no one gave them guidance, there is no legitimate basis for requesting a re-recording.
What we do instead
At Nubart GUIDE, once a script has been finalized and approved, our production team goes through it before it ever reaches a speaker. Every proper name — artist, architect, historical figure, place — is flagged for potential pronunciation difficulty, relative to the target speaker's native language. This is a key distinction: the same word may require flagging in one language version and none in another. "Wilhelmshöhe" stays unmarked in the German script. It gets flagged in the French, English, and Japanese versions. "Zurbarán" is unremarkable to a Spanish speaker and a serious problem for almost everyone else.
For each flagged word, we produce a short audio reference: a recording of the word spoken clearly, first at normal speed, then with stress placed on each syllable in turn. The files follow a simple but important convention:
- One MP3 per word, named after the word itself —
Zurbarán.mp3,Hepworth.mp3,Eyck.mp3 - Stored alphabetically in a shared folder alongside the script
- Flagged in blue in the Word document, so the speaker spots them instantly while reading
The result is that the speaker sees a blue-flagged term mid-script, opens the shared folder, and finds the corresponding file in seconds — no scrolling through a long reference recording, no hunting through footnotes. The lookup is fast enough to happen naturally in the flow of a recording session.
We send the script as a Word document rather than a PDF. This lets each narrator adjust font size, line spacing, and layout to their own working preferences — something that matters more than it might seem. Voice artists have well-established recording routines, and a script they can't adapt to their setup creates unnecessary friction before a single word is recorded.
These reference files are made with native speakers where the pronunciation is standard, and by the museum's own team where regional or institutional conventions apply. That last point matters: a place name can have an officially sanctioned pronunciation that differs from how locals actually say it. For an audio guide, local convention is usually what the visitor expects to hear.
We arrived at this approach through trial and error. We originally recorded all flagged words into a single audio file in script order — useful in theory, clunky in practice, since the speaker had to skip through the recording to find a specific word mid-session. We also tried embedding individual file links directly into the script document, which was technically possible but too labor-intensive to maintain. Individual named files in a shared folder turned out to be the most reliable solution.
What this looks like in quality control
After recording, pronunciation is an explicit checkpoint in our review process. We don't expect a German speaker to produce a phonetically perfect Spanish /r/ or a French speaker to master the Welsh /ll/. The standard we commit to — and that our Terms and Conditions reflect — is pronunciation that is substantially close to correct: recognizable and not jarring to a native-speaking listener. That is an achievable bar, and one that proper briefing makes reliably reachable.
In practice, when a reference file was prepared and delivered, we have clear grounds to request a re-recording if the result falls short. When it wasn't — as happens with productions we haven't managed from the start — that leverage disappears. This is one reason we recommend involving a production partner early, before the script reaches the recording stage.
A word about Forvo and online resources
For less common words, we sometimes consult Forvo, a crowdsourced pronunciation database with recordings in hundreds of languages. It's a useful starting point, but not a reliable final source: the quality of individual recordings varies considerably, and some are simply wrong. We treat Forvo as a lead to verify, not a verdict. When in doubt, a brief recording from a native speaker on the team — or from the museum's own staff — is always preferable.
What about AI voices?
AI voice generators have made certain aspects of multilingual audio guide production faster and more accessible. Pronunciation of foreign proper names is not one of them. In our experience producing AI narration at Nubart GUIDE's entry service level, artist names and foreign toponyms remain among the most persistent failure points — and unlike a human speaker, an AI voice cannot be given a reference recording to learn from. If pronunciation accuracy across multiple languages and cultural contexts matters to your project, this is one of several reasons to weigh carefully when choosing between AI and human narration. We've written about this tradeoff in more detail in our assessment of AI voices for museum audio guides.
What museums can do on their end
The most useful thing a museum can provide before production starts is a list of proper names that might be unfamiliar to foreign speakers — artist names above all, but also place names, historical figures, and any collection-specific terminology. Three things are particularly helpful:
A proper name inventory: every artist, architect, and location name in the script that a non-native speaker might mispronounce. You don't need phonetics — just the list.
A language check: if a name has an accepted conventional form in other languages (the kind you find by checking the language menu on Wikipedia), flag it. "Firenze" and "Florence" are an easy example; many museum-specific names are less obvious.
A voice reference: an informal smartphone recording of a curator or staff member saying the tricky names out loud. Two seconds of audio is more useful than a page of written guidance.
The museums that provide this information at the start of a project end up with better audio guides. It's a small investment that prevents a disproportionately large problem.
