The talk dives into the functionality of FrameStory, a Python library for generating video descriptions, which is integral for ensuring accessibility for people with visual impairments.
It touches on aspects such as:
Image captioning using BLIP (Salesforce/blip-image-captioning-large) and its advantages over YOLO and CLIP (traditionally used for semantic indexing and search for multimodal content and object identification)
Extraction of significant frames using OpenCV by checking for threshold between consecutive frames.
Generation and de-duplication of captions for significant frames
Furthermore, the talk touches on extending FrameStory to make it compatible with modern Python tooling such as Poetry and uv for better coverage by a fork developed by me under FOSSIA named framestoryx.
This fork is being used with TranscribeIt, a free software multimedia accessibility application developed under FOSSIA.
The fork is being worked on to support segmented descriptions, asynchronous I/O for downloading videos for non-blocking operations in asynchronous functions and multilingual descriptions for improved accessibility.
Understand the working of image captioning using BLIP in FrameStory
Understand the process of significant frame extraction and efficiency for video description generation
Look for improvizations in FrameStory by development of framestoryx and the scope of further improvements for future of accessibility
I like to see FOSS being used for a11y