Multimodal neural networks are advanced AI models capable of interpreting and integrating multiple types of information at once—such as text, images, and audio—allowing them to form a richer, more accurate representation of reality. This capability makes the technology highly relevant for military applications, but it also raises significant ethical concerns.
Artificial intelligence (AI) is developing at an extraordinary pace, and multimodal models—types of deep neural networks—have emerged as a prominent topic in both commercial and defense contexts. To summarize recent research progress, scientists at FOI have published the report Introduction to Multimodal Models, which provides an accessible overview of the state of the field.
The appeal of multimodal neural networks lies in their ability to process and relate different forms of data simultaneously. For example, such a network can recognize that an image of a tank corresponds to a textual description of a tank, explains Edward Tjörnhammar, a researcher at FOI and one of the report’s authors.
“The strength of multimodal models is their capacity to evaluate multiple input types at the same time,” he says. “By combining different contextual data, they can construct a more precise representation of the world. That in turn enhances their ability to understand and interact with complex situations where multiple senses or information streams are involved.”
Humans operate multimodally
There are already commercial AI services that generate text, music, or visual art from a text prompt. By operating multimodally, an AI system or robot can behave in ways that resemble human perception and decision-making more closely.
“Humans function multimodally by instinctively combining senses like sight, hearing, touch, and balance,” says Edward Tjörnhammar. “Today, for example, there are multimodal industrial robots that perform complex autonomous tasks in controlled environments. However, translating that capability to military use is a major leap.”
He points to Boston Dynamics as an example: the company develops robotic “dogs” used in industry and construction to move objects or operate in hazardous areas where people should not go. Fire departments in New York have also acquired two such robots. Although these robots are designed for complex environments, they do not currently rely on multimodal deep neural networks.
“If you want a robot similar to those models to traverse unknown terrain on a battlefield, the challenge becomes much harder,” Tjörnhammar notes. “The robot must interpret terrain, topology, and environmental cues to make autonomous decisions based on that context. A multimodal model could perform this, but today such models remain too large and resource-intensive for many real-time, field-deployable applications.”
From commercial to military uses
It is common for defense organizations to examine commercial solutions and adapt them for military needs. Edward Tjörnhammar highlights several areas where multimodal AI is already being applied or is likely to be applied in military settings.
“Multimodal systems can, for instance, analyze satellite imagery, interpret battlefield audio, or process geolocation data—and combine these inputs to support real-time decision-making,” he explains, adding:
“But what happens when weapons systems begin to make their own decisions based on situational analysis? Or when information systems integrate an increasing number of autonomous decision points? Ultimately, these developments concern life-and-death ethical choices. More autonomy in the decision chain means fewer moral decisions made directly by humans.”
One of the greatest concerns about AI—especially in military contexts—is the risk of misuse.
“You can imagine a future where a single social media post could trigger a missile strike on a residential building,” he warns.
Research advancing at breakneck speed
This scenario naturally evokes the targeted strikes carried out by Israel in Gaza. Tjörnhammar confirms that Israel is at the forefront of applying advanced AI and multimodal technologies for military purposes. Reports indicate that systems like Lavender and Gospel are used for automated target identification to generate prioritized attack lists, often referred to as “kill lists.”
“We do not currently know whether Gospel or Lavender specifically employ multimodal models, because their exact capabilities are not publicly detailed,” he says. “Nevertheless, research is progressing at a staggering rate, driven not only by large technology companies that invest heavily in AI but also by national defense agencies.”
The FOI report emphasizes that multimodal models—one component of what is commonly referred to as AI—are likely to have a substantial impact on everyday life as well as on the future shape of defense organizations.
“So long as democratic debate remains active, AI can benefit all of us, but we must stay vigilant,” Tjörnhammar concludes. “AI has enormous potential in defense, but we should not allow it to diminish the number of moral and ethical decisions that ought to be made across military organizations.”
Read the report
Introduction to Multimodal Models