Although current large language models are complex, the most basic specifications of the underlying language generation problem are simple to state: given a finite set of training samples from an unknown language, the task is to produce valid new strings from the language that do not already appear in the training data. The question considered in this work is what can be concluded about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L, which is known only to come from a countably infinite list of candidates. A computational agent is said to generate from L in the limit if, after some finite point in the enumeration, it is able to produce new elements exclusively from L that have not yet been presented by the adversary. The main result established by the authors is that there exists an agent that can generate in the limit for every countable list of candidate languages. This stands in sharp contrast to the negative results of Gold and Angluin in the classical model of language learning by identification, where the goal is to recover the exact target language from samples. The difference between these two outcomes indicates that identifying a language is a fundamentally different problem from generating from it.
Link to the paper: https://proceedings.neurips.cc/paper_files/paper/2024/file/7988e9b3876ad689e921ce05d711442f-Paper-Conference.pdf