The importance and advantage of using multimedia as a learning tool is clearly evident. Taking advantage of not only one of our five senses but two or three is vital in the absorption of information and its ability to remain in our memories. Video is the most widely used form of multimedia for learning, combining sight and sound to engage our visual and auditory learning centers. Pairing certain soundtracks or speaking in a certain way, using different colour schemes or editing styles; variety can help appeal to various audiences and help to hammer home your points, but what if your audience is unable to access your media as ‘multi’? A blind person might not be able to see your visual storytelling or smaller details that are shown rather than told, and a deaf person might not be able to glean the full lesson from purely those same visuals. Luckily, we are able to side-step this, with the use of described video (DV) and subtitling respectively.
Link to: Sandy’s View – How do people who are blind or visually impared watch TV and movies
If you watch traditional TV, then you have definately heard the phrase “this program is available in described video, for the visually impared”. But maybe you are unsure as to what it means. For a better sense, watch the youtube video above, as it is a PSA created for exactly that. It shows how a low sighted person might view a scene with and without DV and the stark difference it can make for their experience. But to speak further on the topic, as someone who does not use DV and thus does not have any personal experience to speak from, I feel that it is better coming from a person who has. In her article (linked above), Sandy talks about the ways that DV is able to assist those who have low or no vision, as well as its shortcomings. How TV networks like PBS and FOX provide their audience with the ability to ‘watch’ their shows, but how Netflix isn’t quite there yet. I think that this is an important addition to what we see as standard for multimedia. We have auto-generated subtitles (even if they are not perfect) and using them we are able to share our multimedia with not only hearing impared people but also people not native to the presented language. It is time that we apply these AI techniques to DV as well, further expanding accessibility.
Similar to AI subtitling, programs that provide ‘text-to-speech’ have become popular among students in recent years. I know this myself since, with my ADHD, reading large blocks of text is challenging. I find myself skipping lines or skimming the content, or in the case of scientific papers, reading the first and last paragraph of a section. This can lead to missed information or misconstruing the context of the words I am reading, and it often leaves me having to go re-read which is inherently inefficient time-wise. But our text-to-speech programs still aren’t perfect. Listening to a robotic voice monotonously reading out sentences is challenging enough without considering the complex way that the English language is spoken. I myself have just recently discovered this as I began my Japanese studies. Unlike English, Japanese is a phonetic language, meaning that the way in which they pronounce the characters in their alphabet is always the same. As well, the words themselves have specific ‘pitch accents’ (not as strict as Chinese tones but similar) and so it is easier for an AI program to be trained to mimic natural speech patterns. But in English, the letters are pronounced differently depending on how the word is built so names and less common words might be pronounced incorrectly by an AI (for example, my own name is pronounced wrong by my google home, saying something like ‘meh-ee-ra’ instead of ‘mee-ra’ due to its spelling). And the pitches of our voices rise and fall differently depending on punctuation, the length of the sentence, the intended emotion, and where the words themselves are placed in the sentence. These things are very hard to program into a computer without a very well-trained neural network that has access to immense amounts of natural speech. This is why I believe that ‘Vocaloids’ occurred first in Japan.
Vocaloids are voice-synthesizing programs that people can use to add sung lyrics to music. The 100 basic sounds of the Japanese language are recorded by an individual and stored in a sound bank, and then those sounds can be manipulated in other software to control for pitch, etc. which allows the voice to mold to fit the song or dialogue. These Japanese sound banks are also used by English musicians, but at first, the lyrics were hard to understand as trying to mimic English with only a set of 100 sounds is considerably difficult. But now, as Vocaloids become more popular, it has become easier to manipulate the sounds. For example, the song “Aura” by Ghost and Pals is ‘sung’ by a recently developed Vocaloid, Solaria, and it honestly sounds like a human voice with auto-tune which is extremely impressive. If you are curious to compare to previous Vocaloids, visit Ghost’s Youtube channel and give some of their older works a listen (I would recommend “Candle Queen” as a middle example and “Novocain” as an early example and is what is commonly recognized as Vocaloid), the evolution is truly interesting.
This is all to say that in the future, UDL guidelines (specifically Action and Expression) might extend to include Vocaloids as an option to substitute for voiceovers. For people who are wary of using their own voice, whether due to anxiety, speech impediment, heavy accent, or otherwise; being able to craft a speech and have control over the tone and sound is a more personal way to dictate written word than our current text-to-speech models.