Introduction to Generative AI for researchers

A useful way to think about LLMs is as a well-read, tireless and eager-to-please ‘intern’ who often misunderstands and gets things wrong and so needs explicit guidance, context, steering and examples.

Generative AI in its broadest sense refers to AI systems that create new content, predominantly text but also images, audio and video, based on users’ natural language prompts. Before 2022, the most impressive AIs (at least as far as widespread public awareness is concerned) were narrow in both scope and value, e.g. or . Most research or work-related examples were in the domains of prediction, classification and basic natural language processing like sentiment analysis or chat bots informed by a specific, curated knowledge base. Narrow but extremely effective AI ‘recommendation engines’ proliferated in the corporate and social media spheres. The idea of a more generalised artificial intelligence was still mostly theoretical until a major advancement that combined neural networks (in particular the ) and colossal investment in computational hardware and data. This led to the development of GPT (Generative Pre-trained Transformer), with Open AI’s GPT-3 being the first instance of a chatbot, powered by a Large Language Model that was capable of convincingly human-like textual interactions, including entirely novel language outputs.

Large Language Models

By far the most prominent form of generative AI now, and likely for the foreseeable near future given the dominance of language in most human interactions that we would associate with any kind of intelligence (and certainly for the majority of scholarly research), is the Large Language Model (LLM). The basic idea behind LLMs is fairly simple to grasp: for any given text, what is the likely text that would follow it, given the word associations in the training dataset? illustrates the basic concept (though in an extremely simplified way). GAIR_Token_Prediction GPT4 was trained on almost all the public text on the internet (speculated to be in the trillions of words), and the primary objective was to predict the next sequence of words as well as it can. In order to be able to achieve effective prediction, it has to somehow internalise a representation of the human world, which confers an advanced – though still limited and in many ways alien – understanding ability, emerging mostly from the correlations in text it has been trained on. In a sense, the text itself can be considered a limited projection of the world and by implication of humanity. A fascinating report published by Anthropic in 2024 entitled "”, offers promising early insights into how LLMs create their own internal abstractions which ‘fire’ in relevant sections of text.

GAIR_Anthropic_LLM_Abstractions

While these insights representing exciting and critically important for long term human-AI value alignment early steps, advanced LLMs are still much too complex and advanced to provide the kind of explainability or audit trail many AI regulations have required. The fundamentally stochastic and unpredictable nature of LLM outputs also means strict research reproducibility is difficult to achieve beyond very simple tasks such as classifications with clear rules and boundaries.

A useful way to think about LLMs is to treat them as a well-read, confident and eager-to-please ‘intern’ on their first day on the job, who often misunderstands and gets things wrong. They are not deterministic machines. What they excel at is in producing simulated, plausible-looking text, which is a remarkable technological breakthrough in itself (), but it doesn’t constitute a valid or reliable source of knowledge about the world. Plausible looking text outputs are often correct, and with the new Chain of Thought paradigm (see next section) they have been better and better at being correct more consistently, so it’s forgivable if people accept its outputs at face value. But LLMs can and do produce incongruous so-called ‘hallucinations’ where they confidently output entirely incorrect assertions.

'Reasoning'

In September 2024, a new kind of generative AI model was announced by Open AI: o1. This was followed in subsequent months by Google Gemini's 'thinking', Deepseek R1, Grok 3 and Claude Sonnet 3.7, all of which mirror the o1 approach of generating chain of thought 'reasoning' steps (to be thought of more pattern matching of steps that are more likely to lead to correct answers based on the training data) out of the box before answering. It has opened up new opportunities for using LLMs on more challenging tasks that require correct answers or some known measure of quality, vs simply creative token prediction based on the training data. Here are the latest LLM rankings from Livebench.Ai, with the top 7 all represented by these new 'reasoner' models:

GAIR_LiveBench_Leaderboard

Since early 2025, these models have demonstrated more reliable step-by-step capabilities. Unlike earlier LLMs that required prompt ‘tricks’ such as asking the model to explain every step or pretend to be an expert, these new models automatically engage in chain-of-thought reasoning. The upshot for researchers is less manual prompt engineering and a greater likelihood of correct, well-reasoned outputs, provided sufficient context is given. This shift expands generative AI tasks towards more thoughtful, scientific and empirical work, and changes the approach to prompting towards an emphasis on careful instructions, real data and detailed context more than clever manipulations.

'Deep Research'

In February 2025, Open AI released which uses the full o3 model and a long, more rigorous process of web search to produce good quality syntheses and reports. It's very limited in the sources it can access: notably, it cannot access paywalled scholarly content. For initial intel gathering for information available on easily accessible public websites, it is very useful and far superior to previous web search integrations with AI. The number of errors is far smaller, there is a far greater breadth of sources and the analysis and synthesis process is at a far greater depth than any previous gen AI model has been capable of. It often takes 10 minutes or longer and can produce cited reports of 10,000 words or more. A key deficit is that its 'saturation point' to reach a conclusion is too early (likely to save costs); it rarely considers searching for newer or contrasting sources once it's found something 'good enough'. Nonetheless this was the first instance where an AI tool has shown capability for breadth and depth of web search and expanding its access to scholarly publications and allowing it to continue for much longer will likely constitute a significant breakthrough for accelerating literature search and empirical information gathering.

Potential future directions

The major AI players have recognised the inherent deficits of LLMs that train on vast human-produced text and simulating ‘answers’ based on the messy source and the hugely constraining impact of maximum context length windows. While scaling up to the tune of training on trillions of words is the number one reason LLMs are as good as they are now, efforts are being made to use generative AI to generate synthetic training data, iteratively evaluated (including by more advanced reasoning models), to maximise the quality of underlying data. The hope is to reduce contamination of organically generated human errors to bring increased reliability, accuracy and overall quality of outputs. An early paper entitled, “” (Gunasekar et al. 2023) showcased impressive results given the tiny corpus of training data and complexity of the model (1.3bn parameters, compared to GPT4 which, while not official, is said to have over a trillion). ‘Garbage in, garbage out’ has always been true and it is certainly the case for LLMs trained on almost the entirety of public human-produced text.

In December 2024, Open AI announced impressive results of their new o3 model which, among other benchmarks, was said to have represented a genuine breakthrough in reasoning by the creators of the ARC AGI Prize, who had previously stated that it would be several years before an LLM architecture could perform well on these tasks. o3's Codeforces result places it higher than 99.8% of global competitive coders, who as a group are already above average. Its GQPA (google proof, PhD-level science questions, in the sense that a typical PhD in the field would score about 70%) score exceeded all expectations:

GAIR_GPQA

In January 2025, Deepseek R1's reasoning model went viral thanks to using a much simpler and cheaper training reward approach and curating a smaller but higher quality training dataset. This ability to reach near-o1 level results at a fraction of the computing cost created such a shock that Nvidia's stock price - the giant GPU manufacturer known as the modern-day AI equivalent of the shove provider during a gold rush - dropped more than 15% in a single day.

Agentic AI

There are also early – though very limited and unreliable – manifestations of what some refer to as ‘agentic systems’ that can handle long term multistep tasks incorporating planning, reflection, decisions, short term memory and interacting with digital software in a meaningful way. In late 2024, Anthropi

杏吧论坛

Large Language Models

'Reasoning'

'Deep Research'

Potential future directions

Agentic AI