A query to MT specialists regarding linguistic concerns …
What I see here is the big obstacle in Machine Translation is not multiple meanings of same spellings in source, but term meaning in different dialects (Example: enUS/enGB) of source! It is fine to claim/blame post editing; but due to multiple meanings of a term or different words used for same meaning from different dialects, or the source in mix dialects; the source can be tricky and becomes ambiguous; even for a domain expert (in case of non-availability of surrounding info) it could be impossible to get correct understanding in post editing.
From number of articles am reading for almost last 7-8 years “not even one has discussed on this. Tod day, there are number of approaches in MT process; everyone is evolving by time. Rule-based Machine Translation, Transfer-based Machine Translation, Statistical Machine Translation, Example-based Machine Translation, Phrase-based Machine Translation, Hybrid Machine Translation, Interlingual Machine Translation, are today’s some main known methodologies/techniques.
I went through each of the methods mentioned above; and I found no discussion on/about preparation or process/es on language part, other than about creating a corpus or an algorithmic term list or a grammatically structured TermBase. Is my impression is true, that no MT designer thinks about existence of source dialects (Let’s forget 100+ official dialects of English; but you have to accept two at least; enUS/enGB!) or a domain specific structural differences in sentencing (Marketing/Legal/Core Medical/Core Scientific) and term usage?