- cross-posted to:
- magick@group.lt
- hackernews@lemmy.smeargle.fans
- cross-posted to:
- magick@group.lt
- hackernews@lemmy.smeargle.fans
There is a discussion on Hacker News, but feel free to comment here as well.
There is a discussion on Hacker News, but feel free to comment here as well.
Regarding the Portuguese translation (here’s a direct link of the relevant text, the substack author claims “I would say that this is at the level of a human expert”; frankly, that is not what I’m seeing.
(NB: I’m a Portuguese native speaker, and I’m somewhat used to 1500~1700 Portuguese texts, although not medical ones. So take my translation with a grain of salt, and refer to what I said in the comment above, test this with two languages that you speak, I don’t expect anyone here to “chrust me”.)
I’ll translate a bit of that by hand here, up to “nem huma agulha em as mãos.”.
I’m being somewhat literal here, and trying to avoid making shit up. And here’s what GPT-4 made up:
Interesting pattern on the hallucinations: GPT-4 is picking words in the same semantic field. That’s close but no cigar, as they imply things that are simply not said in the original.
Wow, that’s really interesting!
I don’t know much about how GPT4 is trained, but I’m assuming the model was trained in Portuguese to English translations somehow…
Anyway, considering that a) I don’t know which type of Portuguese GPT4 was trained in (could be Brazilian as it’s more generally available) and b) that text is in old (European) Portuguese and written in archaic calligraphy, unless the model is specifically trained, we just get close enough approximations that hopefully don’t fully change the context of the texts, which it seems like they do :(
I might be wrong, but from what I’ve noticed* LLMs handle translation without relying on external tools.
The text in question was printed, not calligraphy, and it’s rather recent (the substack author mentions it to be from the 18th century). It was likely handled through OCR, the typeface is rather similar to a modern Italic one, with some caveats (long ʃ, Italic ampersand, weird shape of the tilde). I don’t know if ChatGPT4 handles this natively, but note that the shape of most letters is by no means archaic.
In this specific case it doesn’t matter much if it was trained on text following ABL (“Brazilian”) or ACL (“European”) standards, since the text precedes both anyway, and the spelling of both modern standards is considerably more similar to each other than with what was used back then (see: observaçam→observação, huma→uma, he→é). What might be relevant however is the register that the model was trained on, given that formal written Portuguese is highly conservative, although to be honest I have no idea.
*note: this is based on a really informal test that I did with Bard, inputting a few prompts in Venetian. It was actually able to parse them, to my surprise, even if most translation tools don’t support the language.