Bloomberg’s announcement that it created a ChatGPT-like large language model focused on finance created a bit of a stir.
“BloombergGPT AI may be the harbinger of the next wave of corporate AI,” Ethan Mollick, a professor at Wharton, tweeted.
He noted that building models is all about the training data and Bloomberg enjoyed the advantage of including proprietary data about finance as well as general information scraped from the Web.
Reading the Bloomberg research paper provides some insight into the strange terrain where we find ourselves.
Among other things, Bloomberg used a data set called “Enron Emails.” The emails — for those not familiar — are a cache of 500,000 messages sent by about 150 senior executives at the energy trading company. The emails were made public by the government during an investigation following the company’s 2001 bankruptcy.
If you haven’t worked with machine learning, it would seem like a strange thing to include.
If you have, you understand that AI is all about big data and that such large troves of freely available data in the public domain are exceedingly rare. Hence, the inclination to use anything and everything you can find.
Bloomberg also used some predictable sources such as Wikipedia and PubMed Abstracts, as well as a lesser-known data set called “books3,” which is a corpus of text from 196,000 books cleaned up and published by an AI engineer named Shawn Presser.
The magic in Bloomberg’s new model — and the reason it garnered attention – was that it focused on finance and included text that is not publicly available, specifically decades of coverage by Bloomberg News.
Why proprietary data matters so much was perhaps best explained by Chamath Palihapitiya, the CEO at Social Capital, who noted in a recent interview that if Google, Facebook and Microsoft all used the same inputs their AI outputs would be similar.
“If you have one extra thing, one little ingredient that the other companies don’t have, your output can be remarkably different,” he said.
Chamath predicted that companies with valuable data will increasingly hoard it. They may sell the data or even become acquisition targets because the data is so valuable.
If training data is so crucial, I would also expect more legal challenges as well as demands for transparency.
Many companies don’t realize their data is being scraped on the Web or used to train models. Some may object. Open.AI, the company that created ChatGPT, was vague about where they obtained much of their training data.
Companies with AI models infused with proprietary data will likely keep them private. Bloomberg, for example, is likely to use its models within its professional terminal product.
A fact glossed over in the Bloomberg press release was that the truly unique content in BloombergGPT (i.e. Bloomberg News) accounted for just 0.7 percent of the training data.
I’m not an engineer, so I cannot say if 0.7 percent is enough to meet Chamath’s hurdle of “one little ingredient.”
It’s hard to find data as valuable as Bloomberg News to train financial models, but not impossible.
Another commenter on Twitter noted: “Bloomberg is only unique in combining their access to financial data with reasonably talented data science teams. OpenAI could surely license similar datasets from other financial data providers – Thomson Reuters, FactSet etc.”
The announcement underscores four likely future trends in AI:
–AI models with unique industry data will be the most accurate and effective
–High-quality industry data is limited and will become more valuable
–Finely-tuned industry models may be unavailable to the general public
–Legal challenges may emerge from companies over training data
Great timing Ted. The Bloomberg announcement was one most in finance were waiting for. That it took as long as it did was a slight surprise to me, simply because things are moving so fast. Obviously it is better to be prudent with an announcement like this, rather than rush it.
Some observations. Using these models is becoming cheaper and frictionless, so every app that needs an AI capability will have one.
The AI models being used to train LLMs are all similar, so it is not the model that adds the value, it is the training data, which you correctly pointed out.
Corporate training data, the treasure trove of decades of historical emails, files, notes will prove valuable to internal AI efforts. This will help decision making, compliance, customer support, and a myriad of other functions. The gating factor for using this internal data is: 1) is it easy to ingest your firm’s historical into a database well suited for training and recall?, 2) is easy to ingest new data on a daily basis (how taxing is this ingestion to automate or get your team to add to the database on an ad hoc basis)?