Facebook researchers this week introduced Situated Interactive MultiModal Conversations (SIMMC), a novel research direction aimed at training AI chatbots that take actions like showing an object and explaining what it’s made of in response to images, memories of previous interactions, and individual requests. In a technical paper, they detail new data sets designed for this purpose containing around 13,000 human-to-human dialogs across two domains — furniture and fashion — along with several tasks framed as objective evaluation protocols.
Facebook would appear to be working toward an assistant capable of processing data a user and the assistant co-observe, and then outputting replies beyond just plain text based on this data. The hope is that this assistant emulates human chat partners by responding to images, messages, and messages about images as naturally as a person might. For example, given the prompt “I want to buy some chairs — show me brown ones and tell me about the materials,” the assistant might reply with an image of brown chairs and the text “How do you like these? They have a solid brown color with a foam fitting.”
SIMMC supports the development of such an assistant with the aforementioned data sets and new technical tasks, which address task-oriented dialogs encompassing multimodal user contexts in the form of a co-observed image or a virtual reality environment. The tasks get updated dynamically based on the dialog flow and the assistant’s actions.