Generative AI

Cambridge Professor Neil Lawrence joined Yale Professor Nicholas Barberis and Duke Professor Campbell R. Harvey, along with practitioners from across Man Group, to discuss Generative AI. The discussion was moderated by Man Group’s Otto Van Hemert.

Executive summary

Generative AI: past, present, future

How did we get to where we are today? AI is often misinterpreted as machines that solve problems from first principles. On the contrary, the AI of today follows an empirical, data-driven approach: can we look at data in the real world, and reconstruct it in some “intelligent” way? This concept has been employed on a small scale for many years, including in finance, but its full potential only started to be realised once models started being trained on close to everything that everyone has ever produced, at scale. At these huge data scales, through the process of condensing that data, these language models are able to “reconstruct” human reasoning.

Evolution or revolution? There is a view among some economists that no single technology, on its own, has ever been revolutionary – real economic impact always requires a combination of factors. Even if the technology does end up having an impact, history shows that it can take decades for it to show up in the productivity numbers. The question is whether this time is different, particularly given the speed and scale at which developments in AI are occurring.

Will AI replace humans in their jobs? While technology advances, the human element remains essential in decision-making across many professions. To use ChatGPT effectively, the user needs field-specific knowledge, and developing this deep expertise requires experience of doing the task ourselves. However, these models are indeed becoming increasingly advanced, and amongst the professions where human interaction is less important, the impact of AI may be greater.

Can model capabilities continue to advance at this speed? Large Language Models (LLMs) could potentially utilise self-play techniques, similar to AlphaGo and AlphaFold, to accelerate their learning faster than the essentially human-driven reinforcement learning step currently employed. However, this is challenging due to the computational intensity of modelling the real world. LLMs also have potential to learn by interacting with the real world through conversation, actively seeking data to enhance their understanding. Finally, despite being trained on historical data, there is evidence of creativity in LLMs.

How might the ecosystem of models evolve over the coming years? When investing, people tend to prefer open-source and transparent technology over using a company's service where we cannot see how it processes our input to produce the output. From a practical perspective, however, there is unlikely to be a highly capable language model that is open-source and also trainable from scratch. This is because of the hardware and infrastructure requirements that the open-source community does not have.

Another important consideration is the amount of energy a query to an AI model consumes and there is currently a strong line of research into making LLMs more efficient to run, as well as developing smaller, more specialised models.

Part II: Limitations and biases

How do we deal with biases? There is some evidence that LLMs exhibit human psychological biases such as over-extrapolating from the past and overreacting to news. Further, these models can give plausible, well-argued, intelligent-sounding answers, which are often wrong in some respect, known as hallucination. In the machine learning models of the past, bias meant getting things wrong consistently. Today, different biases can actually be encoded via different prompts, and in some cases, this variation in outputs could potentially provide useful information on how uncertain the model is about a given response.

How do LLMs learn compared to humans? A big difference between LLMs and humans is that humans have an ability to learn from relatively little data. LLMs, in contrast, learn from enormous amounts of data. If AI were to evolve to be more like humans, it would need to have an ability to generalise more with less data. We are seeing early evidence of this in the training of ChatGPT, where the reinforcement learning from human feedback stage uses a substantially smaller dataset compared to the primary training data.

What are the risks from bad actors’ use of Generative AI technology? The risks are substantial, particularly in relation to the ability of AI tools to pass the Turing Test, where we cannot tell that we are not talking to a human. The implication of this for the political Generative AI | 5 system and our culture is vast. A prerequisite for this risk to be managed in the future is to have knowledge on whether we are interacting with an AI tool or a person. We need technology to prove identity. Another key risk is that people start to distrust everything. This could result in a world where we no longer think anything on the Internet is true, and consequently revert to more traditional sources such as books for information. There may also be greater incentives for watermarking content as coming from a human, essentially proving identity.

Part III: Impact on the broader economy

What will be the ultimate use-case for Generative AI? A popular use-case at present is retrieval-augmented generation, essentially document search, which allows users to incorporate information that was not originally part of the training data into the model. We certainly see this concept in the discretionary investment management space, where LLMs can help human portfolio managers (PMs) process information more quickly and efficiently. With that said, these use-cases are not individually revolutionary. Arguably, it is less about an ultimate use-case for Generative AI and more about the cumulative effect of one’s activities.

A lot of thought and energy is currently being put into prompt engineering. It is effectively a very challenging optimisation problem, where a one-word change in the prompt can have large discrete changes in the answer. If prompt tuning can be done right however, it can even beat fine-tuned models.

Part IV: Applications in asset management

What is the impact of Generative AI on quantitative asset management specifically? Generative AI can be applied across a number of areas of quantitative asset management, but each comes with associated obstacles or challenges. Firstly, Generative AI could be a means of reducing costs. This can be in the form of not needing as many analysts or developers to achieve the same research output due to productivity gains, or employing Generative AI to unlock strategies that may have been too costly to do in the past, potentially leading to more alpha. The second is as a tool for model selection through testing ideas via creating forward-looking tests based on synthetic possible future scenarios, where all models are subjected to the same set of hurdles. The third use-case is as a predictor: in the context of sentiment analysis, where the LLM learns to associate a piece of text with a sentiment which can then be used to predict returns, the latest LLMs seem quite good at evaluating complex and nuanced text. However, the key challenge with using LLMs in a predictor is the inability to reliably backtest them with no forward-looking information, as the LLM is trained on the full sample of data.

Can LLMs come up with research ideas? Coming up with good ideas for research can be difficult. However, an LLM, trained on all financial literature, might well be able to produce speculative ideas, which can then be automated and tested. The concept of the “meta-researcher” may emerge, whereby large groups of digital researchers are marshalled to increase productivity. However, at scale, the high computational costs of this are likely to be a challenge.

If LLMs are being used to make investment decisions, are those decisions going to be interpretable? Even if an LLM is successfully making some predictions, it is difficult to really understand why it is making those predictions. We can ask a model such as ChatGPT to justify its reasoning, and it will be able to generate a justification on why it thinks a given prediction may be true, based off its training data. However, whether this justification is actually credible is a different story. We are, however, able to extract the full high-dimensional vectors that make up the inner workings of these LLMs. These are potentially useful pieces of information that can help us begin to understand how they arrive at their answers.

In summary, how should we think about this technology? It is an amazing time to be living through this innovation. As with any technological disruption, there will be risks, and we need to identify these risks and try to manage them. The economic impact on the broader economy is happening a lot faster than previous technological disruptions, and that makes this time different. Eventually, we should seize the AI opportunity – the risk is great, but so are the opportunities. It would be a big mistake for any company to ignore this space.

Part I:

Generative AI: past, present, future

How did we get to today’s state of Generative AI? What is it all about?

Neil Lawrence (NL): AI is often misinterpreted as machines that solve problems from first principles. On the contrary, the AI of today follows an empirical, data-driven approach: can we look at data in the real world, and reconstruct it in some “intelligent” way? This concept has been employed on a small scale for many years, including in finance, but its full potential only started to be realised once models started being trained on everything that everyone has ever produced, at scale.

Just a decade ago, computer scientists actually doubted that this data-driven approach could solve computer vision. The breakthrough came with the ImageNet dataset of 14 million images, which proved it could be possible1. Language models have also seen constant advancements, with significant boosts coming first from the RNN (recurrent neural network) and then the Transformer, which enabled us to scale up massively, using parallelisation in GPUs (graphics processing units). At these huge data scales, through the process of compressing that data, these language models are able to “reconstruct” human reasoning.

This progress was fuelled by ambitious companies backed by substantial VC funding, allowing them to experiment with these concepts. Many of these ventures didn't initially need to go to market (the first LLMs, in fact came out of Google), which further spurred innovation.

Perhaps Generative AI will be the first truly revolutionary technology, but history suggests that other factors must also come into play, and that any impact will take a long time to materialise.

Why has ChatGPT all of a sudden become popular now?

Nicholas Barberis (NB): ChatGPT became very popular initially for reasons that seem orthogonal to its actual usefulness, as people were impressed by the stunts that it could perform (such as writing a rap about Milton Friedman’s economics). These stunts were largely useless, but amazing to humans, who would struggle to do them. In terms of more practical uses, people were impressed by its emergent abilities. One such ability is coding, which anecdotally in academia has become very helpful. Of course, the very user-friendly interface was another important factor in its widespread adoption.

Is this an evolutionary technology, or a revolutionary technology?

NB: There is a view among some economists that no single technology, on its own, has ever been revolutionary – real economic impact always requires a combination of factors. Robert Fogel’s Nobel Prize winning work showed that railroads, which many people thought were revolutionary for GDP growth in the US in the 19th century, were actually not so revolutionary, and offered only a modest improvement on alternative transportation methods like canals. Even if the technology does end up having an impact, history shows that it can take decades for it to show up in the productivity numbers, and often in the first decade there is in fact a slowdown as people think about the technology and try to fit it into their existing processes. Perhaps Generative AI will be the first truly revolutionary technology, but history suggests that other factors must also come into play, and that any impact will take a long time to materialise.

Greg Bond (GB): From a business perspective, our general approach to technology is a steady trial and error adoption process as we figure out ways of incorporating it into our investment process. We have had a lot of experience with other technologies that were initially very exciting, and then disappointed a little bit until we eventually figured out how to use them effectively. Even the Internet was ultimately just an evolution in that it empowered us to be better at gathering information, complementing what we do. The question is whether this time is different, whether we need to make a huge change in our business and do things completely differently.

Matthew Sargaison (MS): Even if it is evolutionary, developments within AI seem to be happening at a much faster scale. Typically in quantitative finance research, we avoid fitting more than a handful of parameters in a given model to reduce overfitting. With these language models, however, we are starting with hundreds of millions of parameters, and in the next generation, perhaps a year later, we go straight to hundreds of billions of parameters, skipping a few orders of magnitude. I’m not sure if people foresaw how quick this kind of change would be, or the kind of emergent abilities that we suddenly began to see with larger scales. We went from models producing unconvincing, almost joke-like content to being able to produce impressively useful and relevant text.

While computers can be tools to aid us, they cannot supersede human judgment and understanding. Ultimately, a computer can never be more human than a human.

Campbell Harvey (CH): I really believe it is different this time, in that this period of transition we have typically seen historically is happening a lot faster. Before, when technology required decades to transition, it was possible for people to retrain, and people could adapt over generations. In today’s case, this is hitting us so quickly, that it is creating a different kind of risk to what we have seen in past technological revolutions.

Will AI replace humans at our jobs?

NL: While technology advances, the human element remains essential in decisionmaking across most professions. Computers, although powerful tools, cannot replace us due to their lack of human vulnerabilities and emotions. Crucial to our intelligence are our relationships and emotions, something computers again cannot replicate. The relationships between humans form the backbone of our institutions and professions, and underscore the importance of human-led decision-making. So, while computers can be tools to aid us, they cannot supersede human judgment and understanding. Ultimately, a computer can never be more human than a human.

NB: However, there is still a vast range of professions where it is not important to have that kind of emotional experience between humans.

NL: In those cases, the impact of AI may be greater. There may also be a situation where people start to care more about the human element of decision-making across these jobs, making that a more critical element of what people expect to receive. An interesting consideration is the software engineer, as most of their work is simply translating someone’s will into something that is operating on a computer.

Gary Collier (GC): In software engineering, historically what we have seen over the past 10-20 years is that, when something like an open-source framework comes out, it has actually elevated people to do new things as a lot of heavy lifting has already been done. So I think it is premature to say that lots of software jobs are going to disappear – I think they will probably increase over the next few years. This is because we will need engineers to overcome existing limitations, particularly with respect to interconnectivity between different, often outdated, platforms. While we could have a powerful, integrated AI like ChatGPT extend our code, it must interact with numerous systems that are challenging for modern technologies to communicate with. If every platform was fully API-driven with clear APIs and docstrings, allowing machines to immediately understand and utilise them, this scenario might be possible. However, we're a long way from this level of standardisation.

NL: The challenge overall with ChatGPT is that we need the deep expertise. To design good software, or to create a good legal summary, the user needs field-specific knowledge to use ChatGPT effectively and identify any errors in its responses. Developing this expertise requires experience – we need to do the task ourselves, make mistakes, and then learn from them in order to build up the ability to evaluate ChatGPT’s answers.

Dan Nadler (DN): Even with today’s ChatGPT though, for instance when designing a software system, it is already able to give you a comprehensive answer that contains good field-specific knowledge. It can even be better than an engineer, as it has read so much of the documented information out there, far more than the typical engineer. However, the quality is very dependent on the prompt. If you ask it without any context, it may not give a good answer, but if you provide more details and examples, it can generally produce well-fit responses. In today’s state of the world, admittedly you may need some expertise to design this prompt. But it may be reasonable to expect that these models will get more and more advanced over time.

NL: As we increasingly rely on brilliant AI systems, there's a risk that we may lose our ability to critique their outputs, similar to how dependence on GPS has eroded our navigation skills. A concern then arises because the AI's knowledge is based on historical data, limiting its understanding to situations we've already encountered.

A relevant analogy is the use of fly-by-wire systems in modern aircraft where human instructions are processed by computers to control the flying of the plane. Before we used these systems, we first had to characterise the interface between human and machine: we had to quantify how the aircraft flew, what the pilot’s inputs were, how the aircraft responded, and so on. However, with AI like ChatGPT, we haven't yet fully characterised human language and how we use words to share information. The question then becomes, if we start using AI with such vast knowledge, are we still in control, or merely under the illusion of control?

The data used to train LLMs is text created by humans. Going forward, a lot of text may be created by Generative AI itself, and we may not be able to distinguish between human- and AI-created content. Is there going to be a hard limit on how much data there is out there, and hence a natural plateauing of the technology?

Stefan Zohren (SZ): One very important aspect is the reinforcement learning step in training these LLMs. Right now they are essentially human-driven, which limits how fast they can improve. However, if you think outside the space of LLMs, you have models like AlphaGo and AlphaFold, where their capabilities were developed through self-play. It is much quicker for a model to compete with itself, and so progress becomes much quicker as it is not limited to playing against a human and learning at a human speed. In this setting, for example with coding, you can in principle get the model to write its own code and compete against itself in the code world, leaving the human out. This could also apply in other areas: in biology, the model could generate specific examples that could be physically simulated and tested, hence speeding up the loop as you do not need a human in the middle to write more text.

MS: If we consider the existing corpus of all text written by humanity, that may be sufficient for learning good language translation for most languages. But will it be enough to write a sequel to Finnegans Wake, something more at the extremes of language usage? That often requires the models to be creative and do something new. If the models are just trained on what is existing, it will not be able to do that – it will just write something that sounds very much like what has been written before.

NL: A key limitation here is that simulations, like those used by Formula 1 teams for wind tunnel or fluid dynamics testing, always require some form of abstraction relevant to the question being addressed. These abstractions are designed based on human judgment and understanding, and are needed because completely modelling reality is computationally impossible. That's why empirical methods, which involve direct interaction with the real world, are effective: they let the world do the necessary computations to produce the result.

There is also some evidence that these models can be creative, despite only being trained on historical data.

For LLMs, although they currently have limited motor-sensing capabilities, they can interact with the real world through conversation, generating new data. They can actively seek this data by asking us questions, to enhance their understanding of the human mind. The active data gathering from real-world human interactions could be a compelling development.

CH: There is also some evidence that these models can be creative, despite only being trained on historical data. A recent study from Wharton2 set the task of coming up with a new physical product for the college student market that would retail for less than $50, and gave this to GPT-4 as well as business students. In the top 40 ideas (the top decile), 35 of them (87.5%) were generated by GPT-4: most of the best ideas came from the LLM.

NL: The AlphaGo project also showcased AI's potential for creativity. During the second match against Lee Sedol, AlphaGo performed an unexpected move, astonishing human Go experts. This occurred because we programmed the machine with the game's rules – the 'physics' of the Go world – and let it explore novel strategies. However, this is more difficult to achieve in real-world applications, as accurately simulating the world based on physics and other models is computationally too intensive.

How do you see the ecosystem of models evolving over the coming years?

GB: I think people may prefer open-source, transparent technology, compared to something where we send an input to a company that we cannot see into and get back an output. As an adopter of the technology, especially in the case of asset management, I think I would want to know what’s driving it: if the model is picking a stock to buy, I would want to understand the factors that lead to that decision. For the most part, our existing code base is open source, tailored with a competitive layer that translates the code into an alpha model. There are not many things where we just take it as a black box. It ultimately depends on people’s willingness to be happy using black boxes.

Another relevant aspect from the financial markets perspective is the data that these models are trained on. The technology is not particularly useful if the data that it’s trained on is outdated, or if the data sets are irrelevant. For the reduced form problem of trying to predict financial markets, we may not even need huge amounts of compute in order to digest specific local data. In this case I see smaller, more specific models potentially being more useful, and a key factor being who owns/controls the data

DN: From a practical perspective, in the current state of the world there is very unlikely to be a highly functioning language model that is actually open source and trainable from first principles all the way to being production ready. This is because of the hardware and infrastructure requirements that the open-source community does not have. In the case of the LLaMA models from Meta, it is indeed open source, but Meta is the company that built it, and we are still dependent on them to produce newer and better versions.

NL: One question is the open versus closed debate, which I think is likely to vary: the case of Linux shows how open-source code can become integral to modern operating systems. But the key factor here is compute, where giants like OpenAI/Microsoft, who anticipated this and heavily invested, might maintain a lead. The worldwide under-supply of GPUs, evident from Nvidia’s soaring stock, also plays a part. Still, this scenario could be influenced by tech advancements like moving beyond GPUs.

Today, a query to these AI models consumes 100 times the energy of a search query.

Another aspect is the size of the models, whether it’s one universally applicable model or multiple specialised ones. Cloud technology could play a crucial role with smaller models. A few years back, a review from Google showed data centre electricity consumption for search was about 1% globally, almost comparable to Africa's total consumption. Today, a query to these AI models consumes 100 times the energy of a search query. If these models end up absorbing say 10% of the world’s electricity supply, it's not only impractical, but it is also going to be a major constraint on who has access to these models.

CH: In the future, we may see a case where a vast amount of GPU power is not on the cloud, and rather in computers owned by the public (where only a small fraction of capacity is utilised today). If this is the case, decentralised computing can provide another venue for model developers to train their models.

DN: There is currently also a strong line of research into making LLMs more efficient to run, such as compression techniques.

NL: There are also strong incentives for semiconductor manufacturers to innovate hardware for more efficient LLM operation, such as integer-based computation rather than using half-precision floating-point numbers currently used in GPUs. It is intriguing that, based on current knowledge, the billions of parameters in these LLMs need to exist over a large range (hence why they require floating point), but the precision with which they are represented can be extremely coarse. This to me highlights significant gaps in our comprehension of efficient model operation. Given the incentives for companies outside of Nvidia to innovate in this space, we may see substantial advancements in hardware for even greater computational efficiency.

Part II:

Limitations and biases

In human text data, naturally there are biases, and LLMs tend to pick up these biases. How do we deal with this issue?

These models can give plausible, well-argued, intelligent-sounding answers, but are often wrong in some respect.

NB: We do see some evidence that LLMs exhibit human psychological biases. One example is a paper by Horton3 that shows that LLMs exhibit well-known biases such as status quo bias. Another is a Yale paper4 that gives an LLM news articles and asks it how they change its view on future expected economic outcomes. The author found that the LLM makes similar mistakes to humans such as over-extrapolating from the past and overreacting to news.

NL: These models can give plausible, well-argued, intelligent-sounding answers, but are often wrong in some respect. One thing that is quite fundamental about some of these biases is that they are tricks we use to deal with the fact that we are not able to solve the full problem; they are predispositions to help us try to solve things in a certain way.

The interesting thing is that somehow all these biases and viewpoints can be seen within a single model: if you prompt it in a given way, it can give you an answer that is exhibiting a given bias. This is very different to bias in the machine learning models of the past, where bias meant getting things wrong consistently. In this case, the variation in the output due to biases encoded by different prompts can in fact give us fundamental information on what can or cannot be determined, and this can actually be useful.

GB: We need an objective truth to say that something is biased. If a model is picking stocks, it can make mistakes (i.e., lose money) and have some bias if it has some behavioural linkage for example. However, in cases where there is no objective truth, defining bias is unclear, and that is going to be the challenge.

SZ: Bias issues in LLMs echo similar issues in other machine learning models. Consider a simple price prediction model where we try to classify whether a market will go up or down in the future. Standard neural networks are well-known to be overconfident and might predict with 99% certainty that the price will rise, when the actual likelihood is merely 50.005%. Relying on these predictions for position scaling could quickly lead to disaster.

Despite this overconfidence, we have remedies. Bayesian techniques, for instance, help quantify the model's uncertainties, making it aware of its own unreliability. This is useful for instance in image classification, where a slight alteration, like changing a pixel, can influence the output. Here, Bayesian methods mitigate the model's sensitivity to these kinds of adversarial attacks, thus improving its understanding of uncertainty. Another technique involves dropout sampling in neural networks. By observing the variance in outputs, we can discern the model's accuracy.

A big difference between LLMs and humans is that humans have this amazing ability to learn from relatively little data.

These methods, used in simpler applications, could be adapted for LLMs, exploiting the variance in responses to assess uncertainty. However, the complexity of LLMs makes this a challenging task.

How do LLMs and humans differ in their method of learning?

NB: A big difference between LLMs and humans is that humans have this amazing ability to learn from relatively little data. Children can hear a few words being spoken, and before you know it, they learn how to speak too! LLMs are totally different – they learn from enormous amounts of data. I think the reason humans can do this is because they come essentially pre-packaged with some priors that allow them to generalise incredibly effectively about the world (although these priors can sometimes also lead to cognitive biases). If AI were to evolve to be a bit more like humans, then maybe they would need to have an ability to generalise more with less data.

DN: You already see that a little bit in the training of ChatGPT itself, as the reinforcement learning from human feedback stage already uses a substantially smaller dataset compared to the primary training data. There is also the concept of fewshot prompting (also known as in-context learning), where the model can be given an example in the prompt and will be able to learn from that example.

The risks are substantial, particularly in relation to the ability of AI tools to pass the Turing Test, where we cannot tell that we are not talking to a human.

NL: The way humans typically learn is by acting on the world, and seeing how the outside world reacts to us – these models primarily do not learn in this way at all. However, we could soon potentially see models starting to learn in this human way, as they are now interacting with humans via conversation on a massive scale, and are able to learn very quickly in the space of a conversation. Learning of this kind could result in a new phase of what these models can do.

Bad actors can also use Generative AI technology. How afraid are we of this happening?

CH: The risks are substantial, particularly in relation to the ability of AI tools to pass the Turing Test, where we cannot tell that we are not talking to a human. The implication of this for the political system and our culture is vast – one could imagine an existential risk in a scenario where millions of these bots launched by hostile entities are able to interfere and change public opinion during events like elections. We already saw this in the last US elections at a very crude level before the advent of technologies like ChatGPT that can produce much more convincing and well-written text.

A prerequisite for this risk to be managed in the future is to have knowledge on whether you are interacting with an AI tool or a person. We need technology to prove identity. We do not have this technology right now, but there is a robust and growing research area on decentralised identity as a solution.

The increased capabilities of AI voice replication could also result in more cases where you may get a scam phone call from someone whose voice sounds like someone you know. This kind of “impersonation”, whether based on audio or even simply based on writing style in emails, could be another danger.

GB: Despite these risks, there is an element of society adapting to these risks as well. People are already very sceptical when they get a random phone call. But if the quality of scams increases, people may start to distrust everything, and this may have negative consequences on us dealing with communications that are actually real and legitimate.

NL: That is a key risk. A lot of us still have this “2006” picture of the Internet, where it is a land of information with the odd lake of misinformation. The reality is that the modern Internet is a sea of misinformation with the odd island of information, and bad actors using LLMs for disinformation will accelerate our appreciation of that. It may result in a world where we no longer think anything on the Internet is true, and revert back to more traditional things like books for information. We may also require more socio-technical ecosystems like Wikipedia to emerge, where a combination of modern technology, human involvement, and the right incentives can create a truthful source of information. There may also be greater incentives for watermarking human content as coming from a human.

Part III:

Impact on the broader economy

Where is the killer application? What is going to be the big impact that Generative AI will have?

DN: The most popular use-case that everyone is working on is retrieval-augmented generation. Every single start-up that we talk to is doing some variant of this. This essentially is a kind of document search, which allows you to incorporate information that was not originally part of the training data (such as local documents) into the model.

Jeremy Andre (JA): We certainly see this concept in the discretionary [investment management] space, where LLMs can help human PMs process a lot of information faster and more efficiently. A discretionary analyst needs to spend a significant amount of time every day reading a deluge of text information, from broker research, to news, and regulatory documents. LLMs are able to process all this text, and extract the key points relevant to what the analyst specifically covers or is interested in, and summarise in an email – saving a huge proportion of the analyst’s time. Today, this is the first thing that discretionary analysts are asking for, as a basic productivity enhancement tool. Even for the ones who do not yet trust the ability of LLMs to summarise, they can ask it to find the relevant paragraphs and extract them verbatim.

This text processing can apply to the quant side as well: I have experimented with using an LLM to read a 60-page PDF of a financial academic paper, and in 30 seconds I could ask it what datasets and components I needed to reproduce the results of the paper. This summarising ability was quite good and allowed me to read the paper much faster.

Another key aspect is that LLMs help simplify the interface between human and machine. Many teams on the discretionary side now use it to assist in their coding, such as using pandas for data analysis. When they are unsure on how to write a piece of code, or run into an error, they are now able to be easily assisted by LLMs, whereas previously they had to spend much longer on Stack Overflow and Google to try to find the solution themselves.

There are also more speculative applications that we are considering. For instance, LLMs could also potentially be used in alpha generation, extracting patterns in text data (such as news, regulatory filings, broker research, earnings calls) that predict periods of strong out/underperformance. This pattern-finding can help PMs spot new things. Another possible use case would be the use of GANs (Generative Adversarial Networks) to generate synthetic alternative price histories, which could be useful for tail-risk estimation and risk management.

GC: Usually when it comes to pushing the boundaries of the technology capabilities of the firm, we see the quantitative units leading the way. What is quite unique about this wave of Generative AI applications is that it is different: we see the discretionary and legal teams instead being very active in their deployment of this technology.

CH: Other applications could also include comparisons of documents. You can provide the LLM with 10-K filings across different years, and ask it to highlight the differences and the risks that it can see. There are also many other creative use cases being explored currently, including a recent paper5 that looks at earnings conference call Q&A sessions and gets ChatGPT to predict the answers to the questions given the earlier context, and compares these to the actual answer given by the company executives as a measure of the amount of new/surprising information provided. Earnings conference calls are particularly interesting as they are not totally scripted. In a similar manner, ChatGPT could potentially be a useful tool in trying to analyse earnings call responses to determine the probability of the speaker telling the truth.

GB: So far it is less about “killer app” and just more about “app”. It is not about 3 Sharpe strategies or 10% annualised GDP growth yet just because we can read a document faster – these are not life-changing events. It is like having a map app on your phone – it is useful, but it is not exciting me quite yet. I have not experienced a killer app in the investment management world, as to me it is more about the cumulative set of activities that you do, just how we have been incorporating innovations in technology for many years already. It is likely to be one of 100 apps that we use, not “the” app.

GC: Yes, the killer app does not appear overnight – it evolves and gets there over several years. It depends on openness of data across the organisation, openness of APIs, and the mindsight of different teams on how willing they are to start using their data with this technology. To me the killer app is the integration point, that brings together wide sources of knowledge into one query-able and actionable type of interface.

DN: On the point of what other startups and third parties are offering, while they are all offering different kinds of retrieval-augmented generation, they are mostly differentiating themselves on their user interface (UI), or on other regulatory, legal, and compliance factors. I do not think that many of them have a convincing value proposition, as we are already doing a lot of it in-house. Some of the most compelling third-party examples out there are Microsoft’s Office 365 and Google’s Bard, which connect an LLM to things like your email and calendar, and so you can ask questions about your own internal/personal documents and get it to write emails, etc. At its core, they still largely revolve around this idea of retrieval-augmented generation.

GC: I would add that Office 365 has got quite a lot of promise, based on the demos that I have recently seen, but it is not there yet. It is not particularly impressive if you look at it right now, but the technology is still very new.

JA: While these third-party solutions may be less relevant for a large technology-driven company like Man Group, they may be very useful for smaller, traditionally discretionary places with less quantitative capability. For instance, you can imagine places like private credit funds, who have to crunch a lot of long documentation that tend to be structurally similar but subtly different, could benefit from being able to process this information very quickly.

The nature of the optimisation problem is extraordinary: a one-word change in the prompt can have large discrete changes in the answer.

DN: Another way for start-ups to differentiate themselves is through their use of prompts. There is a start-up called Pathway, whose value proposition is around specific prompts they use. Their product can take a prompt, and solve that problem for you in one portal, by finding documents, writing code, pulling in data from other places, showing charts, and so on.

NL: Prompt engineering is indeed a key point, as it is a big and difficult thing. It is effectively an optimisation problem if you know the kind of answers you are looking for, but the nature of the optimisation problem is extraordinary: a one-word change in the prompt can have large discrete changes in the answer. There is a lot of thought and energy that creative people have been putting into this, as if prompt tuning can be done right, it can even beat fine-tuned models.

MS: Recently a sell-side bank did a talk and mentioned that they now have “prompt engineer” as a job title. However, this wasn’t in the context of alpha generation or even investment management; this was in their customer support team, and their role was to design prompts to help solve customer queries quickly.

NL: I have one more thought on a potential “killer app”. Imagine a hypothetical world where you are given computers and LLMs, but not programming languages. The nature of the programming languages that we would probably choose to design in that situation is not going to look like any of the programming languages we use today. This is because today’s languages all involve some kind of compromise between how a human reads things and how a computer reads things, trading-off compute speed versus convenience of use. Now suppose we combined LLMs with program synthesis (a digital computer with a human interface): this would allow a human to express an idea, and the computer could then show it in different languages and deploy/compile it in the most efficient way. This would be an extraordinary idea, but it feels like it would take some time to get there.

Since trading models should be pointin-time and hence cannot have future information, it is impossible to reliably simulate backtested performance of an LLM like ChatGPT.

Part IV:

Applications in asset management

What is the impact of Generative AI on quantitative asset management specifically?

SZ: One key issue for quant models is that, if I use an LLM such as ChatGPT in a backtest going back many years in history, there is going to be a future lookahead bias. For example, it will associate strong negative sentiment with the words “pandemic” and “lockdown” given the events around Covid in 2020, but it would not have known this in 2019. Since trading models should be point-in-time and hence cannot have future information, it is impossible to reliably simulate backtested performance of an LLM like ChatGPT.

If we were to use an LLM, we would need to build them from scratch, and train them point-in-time, such as every year. This is indeed what we are looking into at the Oxford-Man Institute, where we can take Wikipedia (where the core understanding of many of these models come from) and build it up over time by reconstructing it using the post history, and retrain the models based off these point-in-time snapshots. However, the issue is that this retraining currently limits us to only using simpler models like RoBERTa (a variation of the BERT model) and GPT-2.

It’s also an interesting question on the machine learning side on whether it is possible to edit the model and “block out” future information after a given date, but there are not many good techniques for this currently.

NB: Even if we do not have the comfort of being able to backtest it, there is potentially still benefit to using it as it can still perform well compared to alternatives. In the context of sentiment analysis, where the LLM learns to associate a piece of text with a sentiment which can then be used to predict returns, the latest LLMs seem quite good at evaluating complex and nuanced pieces of text for this purpose.

MS: It could also be possible to test the model using only data after its training cutoff date (September 2021 in the current case of ChatGPT). For instance, you could ask it for its investment thesis if Russia invaded Ukraine, asking it for its trades across a range of asset classes, such as commodities and interest rates. This would be a reliable kind of backtest (albeit with a short history) since the data is out-of-sample.

Given some models may be being updated live, it also raises the question of whether we should be actively collecting data from LLMs today, building up a repository of point-in-time responses on the kind of trades it would like to make, or what it thinks of our trades. This can prove to be useful data for backtesting in the future.

NL: There perhaps could also be some kind of prompt that could give you approximately such a model that does not use knowledge past a certain date. While this will not be exact (since the model would require labelling of its training data to really achieve this), and it will be difficult to guarantee the absence of forward-looking information given the complexity of the model, it may be a reasonable approximation that is much lower cost than having to retrain the model every year.

GB: We already use sentiment analysis, and we think it is useful, but the key question is how much better this new technique is compared to the models that we have already. Especially given LLMs require a much larger investment, we need an answer to that before we can deploy it as a model.

MS: And to measure how good the LLM is, we need a good backtest to validate it.

NL: Another issue with using these models in trading strategies is that all the closed-source models like ChatGPT are hidden behind an API. This means that, as a user, you do not know what the provider may be doing with the model. For instance, GPT-4 apparently has been exhibiting a degradation in quality due to OpenAI wanting to use less compute. You are totally subject to whatever business decisions the provider is making, and this introduces operational risk.

CH: I see Generative AI being applicable in three categories. The first is as a tool that reduces costs. The easiest way to produce alpha is to reduce cost, and this can be in the form of not needing as many analysts/developers to achieve the same research output due to productivity gains. Further, new strategies that were not profitable before due to very high costs may become feasible, again increasing alpha.

The second is a tool for model selection. In quantitative finance we test ideas via backtest, which is a one-shot test, where often there are structural changes in history that can make the backtest of questionable value. Through timeseries generative models like GANs, we may be able to generate many synthetic possible future scenarios, which allow us to create a forward-looking test that can help us to calibrate and select the best model. That is, the models are all subject to the same hurdles in each future scenario. In contrast to the one shot backtest that delivers a single Sharpe ratio, forward testing allows for a distribution of Sharpe ratios – as well as other metrics.

The third is using it as a predictor, and the sentiment analysis example that we have discussed is a good example of this. The interesting question is how long will it be before everyone is using it, and then there is no alpha left? Even for sentiment, there are many firms out there already that produce and sell their sentiment scores.

SZ: One interesting application6 of LLMs in the academic literature looks at exchange message data (such as a message about an order of a given quantity being placed at a given price), which is what exchanges use to build the limit order book. These exchange messages can be used as tokens to feed into an LLM, and then given a history (i.e., the start of a “sentence”), the LLM can then try to predict the subsequent tokens. This approach can be shown to exhibit a 5% correlation with the mid-price over 100 ticks, and hence can be used to predict into the future. It does this since it manages to learn many subtle nuances of price impact, and can be used to generate future continuations of data that can help with training reinforcement-learning algorithms that would otherwise struggle to model the reactions of other market participants.

Can LLMs come up with research ideas?

MS: Coming up with good ideas for research is quite hard. We hire juniors who are hardworking, smart, and they are often quite creative as they have not been entrenched, but by the same token they also have likely not read the entire back-catalogue of the Journal of Portfolio Management or the Journal of Finance – they are not actually exposed to that many ideas. If you then have an LLM that is trained on all financial literature, and you ask it to create some ideas, perhaps blending different things from different areas, you might end up with some speculative ideas, which you could then automate and test.

There has been some evidence in the sciences literature that GPT-4 can be used to develop hypotheses. They can generate a whole range of ideas to test.

GB: Generally speaking, we can either hire more organic researchers, or we can create digital ones. The more digital ones that we can build, the more cost-effective it is and perhaps the way forward in offsetting the drop in research productivity7.

CH: There has been some evidence8 in the sciences literature that GPT-4 can be used to develop hypotheses. They can generate a whole range of ideas to test, and while most of them do not make much sense, there are some that are genuinely interesting statements to test. This approach could also be applied to creating and testing quantitative finance signals, and we do already see some researchers experimenting with this today.

NL: There may be this new concept of the “meta-researcher”, where people begin to learn how to marshal large groups of digital researchers to get that increase in productivity. You can imagine that LLMs at some point would be able to read the literature and produce a quantitative model that can do prediction – which is something that can then be backtested and filtered at scale.

With this approach, a key limitation comes back to this issue of compute. If everyone is doing this, and everyone wants a billion digital researchers that are programmatically querying LLMs, the computational costs would be extraordinary. The constraints would either come from model providers (throttling the rate of queries that can be made) or simply from cost.

JA: An additional method of idea generation in the discretionary world is to re-use the technology behind LLMs, but instead of training it on words, embeds assets as tokens and considers portfolios as “sentences” of these assets. Since there are relatively fewer assets compared to words, you are able to train much smaller models to essentially predict the missing trade in a given portfolio. If this training is done on good PMs, we can try to get it to suggest ideas of stocks that the PM might be missing, and the PM then can very easily check and review these suggestions. Given we have this huge dataset of trading history of our human PMs, maybe using this data with open-source models can give us an edge.

In the more distant future, it is not too absurd of an idea to consider the situation where a PM with a stellar track record is retiring, and the transition to their successors is made easier if LLMs can be trained to replicate some of their trading strategy and thought processes.

NL: This relates to a more general idea of creating a set of systematic strategies, and then asking the LLM to describe why a systematic strategy is making a given decision. The discretionary manager can then digest that information, and decide if they want to incorporate it, or overrule/ignore it. It comes down to this text-based interface between the machine (in this case the systematic system) and the human, that discretionary managers can leverage.

Even if an LLM is successfully making some predictions, it is difficult to really understand why it is making those predictions.

 
If LLMs are used in making investment decisions, are those decisions going to be interpretable?

NB: Even if an LLM is successfully making some predictions, it is difficult to really understand why it is making those predictions. For traditional quantitative strategies, there are usually some good candidate reasons for outperformance that we can talk about (such as loading on risks, exploiting psychological biases, etc.), but here it is not simple to understand the rationale behind the predictions, and that would worry me as an investor.

MS: With something like ChatGPT, we can obviously ask it directly to justify its reasoning, and it will be able to generate a justification on why it thinks a given prediction may be true, based off its training data. However, whether this justification is actually credible or not is a different story.

DN: Even claiming that it “thinks” at all is not really the right word, as the model is just predicting the next token. You can trick it into outlining some intermediate steps as part of a logical thought process (using techniques such as chain-of-thought prompting), and this is perhaps closer to how we think, but fundamentally we still do not know where the answers really came from.

NL: I like a phrase from Konrad Lorenz, who said that thinking is acting in an imagined space. LLMs certainly do not do that – they are “thinking” in a very different way. An interesting thing we can do with LLMs however is that we are able to extract the full high-dimensional vectors that make up the inner workings of these LLMs. These are potentially useful pieces of information that can help us begin to understand how they arrive at their answers.

GB: Every strategy loses money and exhibits a drawdown at some point in time. When this happens, we need to justify and explain to the investor how this happened – we cannot simply say “the machine told me to do it”. Existing black box machine learning techniques require ancillary metrics (such as Shapley values) and graphs that can help explain how it arrived at its decisions. However, I am not sure there is anything like this yet for language models that can help explain its decision-making, and this becomes a key limitation for using these models widely in production.

DN: Perhaps what we care about at the end of the day is whether we can provide a logical and rational explanation for its conclusions. We can certainly ask something like ChatGPT for this kind of thought process and justification, and we know that it can generally provide a logical explanation. So maybe that’s all that we need – it does not really matter where exactly it came from.

Do you have any final, parting thoughts?

CH: I think it is an amazing time to be living through this disruption. I was teaching a course in January when this all hit, and I abandoned course syllabus to talk about Generative AI: I told the students that they need to think very deeply about what is happening in order to understand the implications that this could have for their career path. As with any technological disruption, there will be risks, and we need to identify these risks and try to manage them. However, I think it is a fool’s errand to try to halt technological progress: especially when Generative AI presents such a massive opportunity across all industries. The economic impact on the broader economy is happening a lot faster than previous technological innovations, and that makes this time different. Historically, this kind of disruption has dislocated people but has not really led to job loss since other jobs get created. In this case, the disrupted workers may have trouble finding a new job. This is not the horse and carriage driver switching to a motorised vehicle or a taxi driver switching to Uber. The gains in productivity will also see a different sort of effect, as people will on average work less. The hours that people work in a week historically in the US has come down from 80 hours over a century ago to 34 today, and the productivity increases with this technology may take this down even further.

I believe we do need to seize the AI opportunity. The risk is great, but there are vast opportunities.

Eventually, I believe we do need to seize the AI opportunity. The risk is great, but there are vast opportunities. For a company like Man Group, I think this presents many potential opportunities. It would be a big mistake for any company to ignore this space.

NB: From my perspective, I am excited about the possible connections between these developments in AI and behavioural economics. Behavioural economics cares a lot about how the brain works and its cognitive biases, and has often gained inspiration and ideas from algorithms developed by computer scientists. A good example is reinforcement learning, which is a powerful algorithm used by computer scientists, but which has proved to be of great interest to cognitive scientists as well, as it is highly related to how humans learn. Over the decades, cognitive science has learned a lot from computer science and vice versa, and with the advent of this new technology, I feel like the two fields are going to be significantly more connected.

NL: From a practical viewpoint, I think it is interesting to stand in the global south, in a continent like Africa, and think about the capability and challenges of this technology from their perspective. Global supply chain disruptions tend to affect those on the margins, and for many people in developing countries who evolved into economic niches (such as call centres and low-cost programming in India), they are going to be the most disrupted. There are however many opportunities as well. Places like in east Africa have very little access to professions like doctors, lawyers, accountants, and software engineers, and so the nature of the potential disruption of this technology there is very different.

More philosophically, this rise in Generative AI gives us a new place to stand and look back at our own intelligence. One reason why we are so interested in artificial intelligence in general is because we are somehow fundamentally narcissistic about our own intelligence. We have a Copernican view that our intelligence is the centre of the universe, and this technology is shifting this view, showing us that it is not. I find that very exciting.

 

1. In 2012, Krizhevsky, Sutskever, and Hinton achieved the first human comparable results in image recognition with a GPU trained Convolutional Neural Network (CNN) trained on the ImageNet dataset (Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.).
2. Girotra, Karan, et al. "Ideas are dimes a dozen: Large language models for idea generation in innovation." Available at SSRN 4526071 (2023).
3. Horton, John J. Large language models as simulated economic agents: What can we learn from homo silicus?. No. w31122. National Bureau of Economic Research, 2023.
4. Bybee, Leland. "Surveying Generative AI's Economic Expectations." arXiv preprint arXiv:2305.02823 (2023).
5. Bai, John Jianqiu, et al. "Executives vs. Chatbots: Unmasking Insights through Human-AI Differences in Earnings Conference Q&A.". Available at SSRN 4480056 (2023).
6. Nagy, Peer, et al. "Generative AI for End-to-End Limit Order Book Modelling: A Token-Level Autoregressive Generative Model of Message Flow Using a Deep State Space Network." arXiv preprint arXiv:2309.00638 (2023).
7. Bloom, Nicholas, Charles I. Jones, John Van Reenen, and Michael Webb. 2020. "Are Ideas Getting Harder to Find?" American Economic Review, 110 (4): 1104-44.
8. Park, Yang Jeong, et al. "Can ChatGPT be used to generate scientific hypotheses?." arXiv preprint arXiv:2304.12208 (2023).

 
User Country: United States (237)
User Language: en-us
User Role: Public (Guest) (1)
User Access Groups:
Node Access Groups: 1