![]() |
New Galaxy Fold7 equipped with multi-modal AI features (source: Samsung Newsroom, https://bit.ly/3GzlsWc) |
"Who did you call?" – Interaction between voice command devices and AI assistants
With the
advancement of technology, the number of voice recognition-enabled devices is
steadily increasing, and the importance of "intelligent assistants"
is also growing. In the past, voice recognition was limited to merely
"pressing a button with your mouth" as it couldn't recognize commands
not pre-learned. However, today, it exhibits sophisticated and advanced
recognition capabilities, allowing it to converse with people.
![]() |
wipsglobal.com KR10-2023-7034829, Multi-modal interaction with intelligent assistants in voice command devices |
Samsung
Electronics recently registered a patent titled "Multi-Modal interaction
with intelligent assistants in voice command devices." This technology is
applicable to a wider range of electronic devices, including Samsung's home
appliances and cars, in addition to devices belonging to the "Galaxy
ecosystem" such as smartphones, tablets, and watches manufactured by
Samsung.
This technology can be described as one that identifies which device a user used to call an intelligent assistant in an environment with multiple voice recognition-enabled devices, and then determines how the assistant will interact on that particular device.
Considering the current trend where multiple devices are aiming for integration into a single "intelligent assistant," this technology can enable AI to perform commands accurately and intelligently. As a secondary benefit, it can also prevent the so-called "chorus phenomenon," where all surrounding devices respond simultaneously when the smartphone's AI assistant is called.
"How Multimodal AI Becomes Smarter" – Multimodal Data Learning Method
Multimodal AI
often requires the ability to process complex information compared to simpler,
traditional AI methods (e.g., text-to-text AI). This is because it needs to
simultaneously recognize information from two or more different domains, find
correlations between the data, and provide appropriate answers.
![]() |
wipsglobal.com KR10-2018-0029403, Method and device for learning multimodal data |
According to a
patent disclosed by Samsung, multimodal AI's data learning method involves
processing different signals in their respective networks, then understanding
the context of these signals, analyzing their correlations, and training the
process to deduce the user's hidden intent.
Let's imagine showing a wardrobe with a dress and training clothes and saying, "Pick an outfit for a party." The AI understands the contexts: "something needs to be found in the wardrobe that meets the conditions" and "an outfit for a party is needed." It then analyzes their correlation to deduce the intent: "the user wants to find a formal outfit for a party." Based on this deduced intent, the AI concludes, "A dress would be good."
"Multimodal AI that Sees Like a Real Person?" – Multi-Object Visual Search Technology
As mentioned
earlier, the ultimate goal of multimodal AI is to implement an
"intelligent assistant that understands intent and context and executes
commands accurately." Despite its advancement to the point where it can
process information by sharing the current situation like conversing with a
friend, compared to conventional AI, there's still a long way to go as it
requires prior learning to know where to find visual information to execute a
given command.
![]() |
wipsglobal.com KR10-2023-0104283, Multi-modal deep learning model for multi-object visual search task |
However, Korea
University's patent, "Multi-modal deep learning model for multi-object
visual search task", might enable the implementation of sophisticated AI
when combined with multimodal AI. This patent concerns a method of predicting
the user's intended focus based on information about objects related to a given
image, by simultaneously utilizing text and image analysis technologies.
Solving "I couldn't find a suitable answer for OO" – Proper Noun Learning Method
A drawback of
traditional speech recognition AI was its inability to correctly recognize user
utterances if the words were not pre-learned (e.g., Samsung's Bixby
misinterpreting "Huoguo" as "capture"), or its tendency to
inaccurately deduce words, leading to incorrect answers. These shortcomings can
be a fatal weakness in the recent trend aiming for AI that naturally
communicates with humans.
![]() |
wipsglobal.com KR10-2011-0079586, Apparatus and method for adding new proper nouns to language model in a continuous speech recognizing system |
This patent, filed by ETRI, is a technology that, when encountering a proper noun not present in the existing learning model, compares and contrasts it with existing learning data to filter out appropriate sentence candidates, thereby acquiring new information about the proper noun. When this technology is integrated with multimodal AI, the AI can learn new proper nouns by combining them with other information when encountering unfamiliar objects or unheard proper nouns. Through this, we can expect more natural interaction from AI.
Multimodal Technology: Accelerating True Coexistence with AI
Today, we've
explored multimodal AI technology. As AI development rapidly advances, its
presence in our daily lives is also increasing. The development of multimodal
AI not only makes our lives more convenient but also allows us to dream of
coexistence with AI, making these technologies worthy of our attention.
No comments:
Post a Comment