This year I had the previously to organized together with Hayley Hung and Ekin Gedik a task for the MediaEval Benchmarking innitiative.

The task or challenge is called No-Audio Multimodal Speech Detection in Crowded Social Settings and aims to to automatically estimate when the person seen in a video starts speaking, and when they stop speaking using alternative modalities to audio to perserve privacy.

In contrast to conventional speech detection, for this task, no audio is used. Instead, the automatic estimation system presented by participants must exploit the natural human movements that accompany speech (i.e., speaker gestures, as well as shifts in pose and proximity). More in the taks can be found here.

Looking forward to the first of many editions for this task in MediaEval.