8.18.2025

Leading the coexistence with true AI: Multi-modal technology and patents

 

 Samsung Electronics recently made headlines by announcing its plan to build a "personalized AI ecosystem" that reflects user habits, centering on "multimodal AI," which unifies and recognizes various forms of information, such as voice and visual data.

 Samsung's ultimate goal is to implement a "customized AI assistant" that understands the user's intent and context through all electronic devices and peripherals within Samsung's AI ecosystem, not just AI confined to smartphones. Today, let's explore the technologies hidden within the AI assistants that are becoming a part of our daily lives.

New Galaxy Fold7 equipped with multi-modal AI features
(source: Samsung Newsroom, https://bit.ly/3GzlsWc)

"Who did you call?" – Interaction between voice command devices and AI assistants

With the advancement of technology, the number of voice recognition-enabled devices is steadily increasing, and the importance of "intelligent assistants" is also growing. In the past, voice recognition was limited to merely "pressing a button with your mouth" as it couldn't recognize commands not pre-learned. However, today, it exhibits sophisticated and advanced recognition capabilities, allowing it to converse with people.

wipsglobal.com
KR10-2023-7034829, Multi-modal interaction with intelligent assistants in voice command devices

 Samsung Electronics recently registered a patent titled "Multi-Modal interaction with intelligent assistants in voice command devices." This technology is applicable to a wider range of electronic devices, including Samsung's home appliances and cars, in addition to devices belonging to the "Galaxy ecosystem" such as smartphones, tablets, and watches manufactured by Samsung.

 This technology can be described as one that identifies which device a user used to call an intelligent assistant in an environment with multiple voice recognition-enabled devices, and then determines how the assistant will interact on that particular device.

 Considering the current trend where multiple devices are aiming for integration into a single "intelligent assistant," this technology can enable AI to perform commands accurately and intelligently. As a secondary benefit, it can also prevent the so-called "chorus phenomenon," where all surrounding devices respond simultaneously when the smartphone's AI assistant is called.

"How Multimodal AI Becomes Smarter" – Multimodal Data Learning Method

Multimodal AI often requires the ability to process complex information compared to simpler, traditional AI methods (e.g., text-to-text AI). This is because it needs to simultaneously recognize information from two or more different domains, find correlations between the data, and provide appropriate answers.

wipsglobal.com
KR10-2018-0029403, Method and device for learning multimodal data

 According to a patent disclosed by Samsung, multimodal AI's data learning method involves processing different signals in their respective networks, then understanding the context of these signals, analyzing their correlations, and training the process to deduce the user's hidden intent.

 Let's imagine showing a wardrobe with a dress and training clothes and saying, "Pick an outfit for a party." The AI understands the contexts: "something needs to be found in the wardrobe that meets the conditions" and "an outfit for a party is needed." It then analyzes their correlation to deduce the intent: "the user wants to find a formal outfit for a party." Based on this deduced intent, the AI concludes, "A dress would be good."

 The learning method of multimodal AI aims to equip AI with broader problem-solving capabilities, enabling it to be used in any situation by cultivating its ability to understand the context of multi-party information and extract hidden information.

"Multimodal AI that Sees Like a Real Person?" – Multi-Object Visual Search Technology

As mentioned earlier, the ultimate goal of multimodal AI is to implement an "intelligent assistant that understands intent and context and executes commands accurately." Despite its advancement to the point where it can process information by sharing the current situation like conversing with a friend, compared to conventional AI, there's still a long way to go as it requires prior learning to know where to find visual information to execute a given command.

wipsglobal.com
KR10-2023-0104283, Multi-modal deep learning model for multi-object visual search task

 However, Korea University's patent, "Multi-modal deep learning model for multi-object visual search task", might enable the implementation of sophisticated AI when combined with multimodal AI. This patent concerns a method of predicting the user's intended focus based on information about objects related to a given image, by simultaneously utilizing text and image analysis technologies.

 For example, let's assume the command is "Find the oven and refrigerator in this kitchen." Conventional methods might take an inefficient focus path, such as scanning the ceiling first and then moving to other areas if no focus path is specified. However, if this model is applied, it can extract features like "located on the floor or wall" from the semantic properties of "refrigerator" and "oven," allowing it to adopt an efficient focus path, such as scanning the walls or floor first and then moving to their surrounding areas.

 If such technology is combined with real-time multimodal AI services like Google's "Gemini Live," it could facilitate the implementation of AI that truly "sees and speaks" like a human, by quickly processing information transmitted from the user.

Solving "I couldn't find a suitable answer for OO" – Proper Noun Learning Method

A drawback of traditional speech recognition AI was its inability to correctly recognize user utterances if the words were not pre-learned (e.g., Samsung's Bixby misinterpreting "Huoguo" as "capture"), or its tendency to inaccurately deduce words, leading to incorrect answers. These shortcomings can be a fatal weakness in the recent trend aiming for AI that naturally communicates with humans.

wipsglobal.com
KR10-2011-0079586, Apparatus and method for adding new proper nouns to language model in a continuous speech recognizing system

 This patent, filed by ETRI, is a technology that, when encountering a proper noun not present in the existing learning model, compares and contrasts it with existing learning data to filter out appropriate sentence candidates, thereby acquiring new information about the proper noun. When this technology is integrated with multimodal AI, the AI can learn new proper nouns by combining them with other information when encountering unfamiliar objects or unheard proper nouns. Through this, we can expect more natural interaction from AI.

Multimodal Technology: Accelerating True Coexistence with AI

 Today, we've explored multimodal AI technology. As AI development rapidly advances, its presence in our daily lives is also increasing. The development of multimodal AI not only makes our lives more convenient but also allows us to dream of coexistence with AI, making these technologies worthy of our attention.








No comments:

Post a Comment