Will All Large Language Models (LLMs) Become Large Media Models (LMMs)?
Extending representation and lighting up the dark data
Chemists are teaching GPT-4 to experiment and control robots to help in chemical experiments. (Source: New Science). As Philippe Schwaller at the Swiss Federal Institute of technology in Lausanne notes: “They [AI models] are not really good at representing molecules.” So, the team taught ChemCrow extensive libraries of molecules, chemical reactions, research papers, etc. Then it asked it to create a financially efficient and practical process to synthesize atorvastatin, a high blood pressure medicine and it did a great job.
I believe this extension in representation as well as new data will continue to expand the use of generative AI to more and more fields. The n-dimensional semantics of the large language models seem to be able to expand to include not just languages, but also number, sounds, other media. In short, they will tend toward Large Media Models — as new data get “lit up” for analysis.
One of my doctoral thesis advisors in 1986 was the recently deceased Dr. Patrick Henry Winston, III, who, at the time, was Director of MIT’s AI Lab. He said to me once, “Some of the biggest advances in AI are due to enhancements in representation, not just analysis.” I think the example from ChemCrow is just such an extension of representation to enhance our ability to raise our level of knowledge. It is because of this addition of molecules, test results, etc., that the LLM (LMM?) can analyze and create useful fabrication steps.
Prof. Jim Glass of MIT’s Computer Science, Artificial Intelligence Department (CSAIL) and his team are learning sound, images, text, meaning and other dimensions all at once — in their models. They simply ingest media - in it’s full form and learn topics, participants, story, intent, tone, etc.
The beauty of LLMs/LMMs is that they enable cross media learning — Words, Images, Numbers and Sounds. This will not only enable whole new automation — for example making a movie or cartoon can be automated from front to back. Presentations. Education. I happen to believe that people will use these and often add a human element, plus the AI generated content. But in any case, tools which are now separate, such text to text, text to image, text to sound — will continue to evolve to be multi-media models, integrated, fast.
Why Care?
There are two key things to care about here. First, the ever more powerful symbol chipper we call LLMs will munch down symbols of all types leading to more understanding and more incredible output done with unbelievable productivity by those bit-carpenters who learn these new power tools. Second, the semantic model underneath these LLMs seems incredibly flexible in terms of integrating what were traditionally separated media and representations of the problem — which, to me, means that we will find more and more of the world computable at a faster and faster pace. We will be covering this idea in more depth in our upcoming Generative AI World Conference September 12 & 13 in Boston, MA. Come join us.