Lip Reading to Distinguish Phrases Using Squeezenet

Alaa Yasir

doi:10.47831/mjpas.v3i4.306

Authors

Alaa Yasir

DOI:

https://doi.org/10.47831/mjpas.v3i4.306

Keywords:

MTCNN, visual speech recognition, squeeze Net

Abstract

The need to build visual speech recognition is the main motive that made us research this topic. There is difficulty in understanding speech when visible, there is an urgent need to develop a system that can read lips and understand visual speech. This would help people with hearing impairments to interpret lip movements and understand the spoken phrases and sentences. It would also assist people in noisy environments such as stadiums, airports, factories, and other places where it is difficult to access audio signals. Therefore, researchers are continuously striving to find the best solutions to address this problem. This system is of great importance to empower people with hearing disabilities and others who face difficulties in understanding spoken language in noisy environments.

In an attempt to solve this problem, this research designed and implemented a real-time system to visually interpret and understand spoken phrases and sentences without the need for audio input. The proposed system consists of two main stages: the first stage is detecting the face region and the mouth region, followed by the detection and localization of the area of interest (ROI), which is the lip region. The process is carried out by capturing video of the speaker and dividing it into consecutive frames. After detecting the face region and the mouth region, the area of interest, which is the lip region, is detected. Multitask convolutional neural networks algorithm (MTCNN) was used to perform the detection and localization of these regions. The second stage involves inputting the frames corresponding to the lip region into The squeeze Net convolutional network model for recognizing the spoken phrases and sentences. The proposed method achieved an accuracy of 90%.

Key word: MTCNN and squeeze Net.