Shreya Gopal:
Hello and welcome to this episode of A to Z of Tech. We are returning to the alphabet and the letter L in this episode after our special session last time. I am Shreya, and I am joined by my co-host Louise.
Louise Taggart:
Hello.
Shreya:
In today's episode, we are discussing how we interact or talk to computers or machines, and how we can improve the way we communicate with each other using technology.
Louise:
I'm really excited about this episode. We're going to be learning about some of the technology that's helping us and computers understand each other, some of the challenges involved in that and also thinking a little bit about what the future might bring to us. In this episode of the pod we are joined virtually by our two guests, Natalia Danilova, who is a forensic data discovery specialist here at PwC; and also Juliet Gauthier, who is a strategic product manager at Red Bee Media. Thank you both so much for joining us today on the podcast.
Juliet Gauthier:
Hello.
Natalia Danilova:
Hi.
Louise:
Thank you both. Natalia, if we start our discussion with you. Could you tell us a little bit about your background before you joined PwC and also what your role is here now?
Natalia:
Sure, thanks, Louise. Before joining PwC, I explored the topic of information search on a scientific level. I have a PhD in information engineering, and for that research I was looking at the combination of various techniques for discovery of unknown unknowns on the web. If you remember the words of Donald Rumsfield, ’there are known knowns - things that we know that we know, there are unknown unknowns - these are things that we know we don't know; but there are also unknown unknowns - these are the things that we don't know we don't know, and these are of the most interest to expand your knowledge on a particular subject.
As you mentioned, I work in the PwC forensic services team, and I specialize in data discovery. My role has evolved from supporting investigations with analysing people's emails and documents to find evidence to support an investigation of a fraud. As an example, the last few years, I have been focusing more on helping companies know their data.
Louise:
How does the work that you do now compared to what sort of traditional information gathering might have looked like, maybe, a couple of decades ago?
Natalia:
This is an interesting question. Obviously, we live in a world of information, and there is more and more information available to us every day, but sometimes it can be too much of this information and it becomes just a difficult task to quickly find what you want to look for. We wish we just can ask a computer a question as we would ask an expert to help us find an answer. The idea behind most search techniques is, is a simple keyword search, and this is something that was used before and it's actually quite a popular search technique now.
Imagine you browse an online shop and you want to find a specific product. You know its name, maybe you know its model number you just type that in the search box and you get the results. What happens in the background, the search system tries to find a direct match of your words and phrases, or query, and it will give you the results back, and you can get maybe more accurate results by specifying some further parameters like a colour, or a size - a very straightforward direct match of values.
Then if you look about how the web search engine works, it is a bit more sophisticated algorithm where it starts to look at things like synonyms or maybe stemming, where you don't necessarily look for an exact match, but the search engine tries to predict what does it look like that you're searching for. The popular web search engines, they also rely on popular data sources to improve your search results, so it's basically trying to find matching variations of your search query.
When it becomes more interesting is when we try to make the machine actually understand the meaning of what we're trying to ask and to match it. Whether it's understanding of the content of the documents or files that we ask it to search across. Here there are two techniques, first mathematical; and second, semantical. Both are aimed at creating a conceptual model of the information pool available for searching, but they are different in nature. If we look at semantical models, one of the examples is those based on ontologies.
What is an ontology, it's basically a knowledge model that tries to describe a knowledge domain by defining the actual relationships between the things. As an example, medical professionals use ontology based models to help represent knowledge about systems. And they use those models to type in normal language and human language words what systems the patient has, and those systems will be looking across that well defined knowledge base to try to back potential diagnosis. The second technique, which is based on mathematics, and here we try to kind of train the system to understand the concepts or ideas expressed within the content.
Then, the latest generation of the search approaches is something that tries to rely on a sophisticated language model to bring machine and human interaction close to human and human interaction in the form of a dialogue. For those who follow the news, there is an infamous artificial intelligence based model and one of its use cases is that it helps translate human questions into a piece of code to search a different system for an answer. The previous generation was using over 1 billion parameters and the latest model actually utilises over 100 billion parameters, so it's a very sophisticated natural language based artificial intelligence language model.
Louise:
Brilliant, thank you Natalia.
Natalia:
Thank you.
Shreya:
Thanks Louise. That was a really interesting discussion, and this leads us really nicely into our next guest Juliet. Hi, welcome. I know you do a lot of work around transcription and subtitling, can you tell us a bit more about your background and what you do?
Juliet:
Sure, hi Shreya.
Hi I'm Juliet, I'm a strategic product manager at Red Bee Media, which is a global media services company and we work with big media brands and broadcasters all over the world - the BBC, Channel Four, SBS, Canal Plus International, and various content owners across Europe, and the US, Middle East, Asia Pacific, all over the place. One of the services that we provide is to do with accessibility of content and that is really the area that I work in. Typically, that's referring to things like captions or subtitles for television, audio description, and sign language translation. This is a really interesting topic for me, because obviously a lot of those services are about taking language, and using machines technology of some kind, to represent that language in a way that makes content more accessible for people.
Shreya:
Who is the target audience for something like this and what are some of the applications for this Juliet?
Juliet:
Typically, the target audience historically has been people with hearing impairments or visual impairments, as a way of allowing people access to media content. But we are increasingly seeing people using captions day to day, there's been a real shift, and particularly with captioning, with people in the younger generation who just like to have captions on TV. I have no idea what is driving that, maybe the idea that you can multitask a bit more. But yeah, I can talk a little bit about the process of captioning, and some of the applications that Natalia has talked about.
The first question is, what are captions, and really it is taking audio and turning that audio into text that you can read on the screen. Right now, it's people who make those captions, and they use different types of technology to make those captions, but fundamentally you have an individual, who sits and does something to interact with some piece of machinery or technology to produce captions. We have seen a real development over the years in how we use that technology to turn language into captions. More than 15 years ago I would say you were typically using people who were typing very quickly, and often two people kind of taking a sentence from live content, and just really rapidly transcribing what they were hearing on a piece of live content, like news, for example.
Then around 2005, we really started to see the adoption of automatic speech recognition systems. So, by that you had a person, maybe, someone who had previously been typing very quickly. They sit in a booth with a microphone and some headphones on and they repeat everything that they hear into a piece of voice recognition technology. At that phase of automatic speech recognition adoption, you really needed this trained operator almost as a kind of translator between what they were hearing on TV and making sure that it was pronounced clearly enough for a voice recognition system to understand what they were saying.
We have talked a little bit about how humans and machines are interacting with language, and this is how it's done right now for the captions on TV, you need a person speaking really precisely, saying full stop or comma or question mark to make sure that these things are represented in the text that goes on screen and the captions. It feels almost like hand holding to get the technology to a place where it can represent those captions accurately. But what we're starting to see now is, automatic speech recognition that can do all of that a bit more independently.
I remember going to parties when I first started working in this industry and people would say, oh, what do you do, and I explained that captioning was done by people with microphones and they just sort of said, ‘well why can't you plug it into the TV, just get the audio coming straight through and send that through to the speech recognition software.’ At that point, it wasn't really advanced enough to be able to do that independently, but as I say right now, we are starting to see automatic speech recognition systems that can understand when to put a full stop in a sentence, and they understand when to put a comma or a question mark at the end of a sentence, and they are accurately transcribing things without having to have that person in the middle speaking really precisely almost like a robot to make sure that the text comes out accurately.
Shreya:
Yeah, I think Juliet you paint a really vivid image with the person typing frantically in a booth somewhere. It's really interesting to hear how technology has evolved in that space. Any new technology comes with its own challenges and limitations, what are some of the challenges you see in working with these technologies?
Juliet:
It is a great question. The thing I'm particularly interested in at the moment is this concept of in quotation marks ‘AI racism.’ One of the things that we have noticed in our work with automatic speech recognition is that the automation has to be able to recognise a wide range of voices and accents in order to provide good accurate captions. In our experience, there are some accents that aren’t as represented in the models that we are using to take that automatic process through from just listening to the audio and putting text on the screen. What if an ASR engine hasn't been trained on a wide enough variety of accents, like I said we do see this, how do you handle a situation where a human or a person listening to that audio can hear the word bath, and I say bath because I'm in Blackpool, and they can hear it pronounced as bath, or they can hear it pronounced with an Irish accent or an Indian accent and they know that the word that they need to put on the screen is ‘bath’. If you don't have an ASR engine that has been trained on a really wide variety of data, then I suppose there's a risk that you're not going to get an accurate transcription. There are a lot of different accents on TV, media content is made up by a huge range of people, and you want to make sure that if you are going to move to the fully automated system for that, the automation can transcribe people with a London accent as flawlessly as someone with a Jamaican accent.
Louise:
Natalia, if we bring you back into the conversation and Juliet has touched on some of the challenges that she has seen in this type of technology application. Do you see any similar challenges when you're working on technology and processes that are applicable in a similar way?
Natalia:
If we talk about Juliet’s field of expertise, sometimes for an investigation, we have to analyse phone calls of people and again trying to find an effective way to analyse that in volume is a bit challenging. So accurate speech to text recognition really helps here, because instead of listening to the audio recording and trying to find relevant moments in different conversations that can help with an investigation, it's a lot quicker to work with the textual representation of it and then use all of those techniques that I mentioned.
Louise:
Juliet, from your perspective, are you seeing future applications of this type of technology as it develops that we're not seeing yet already?
Juliet:
You will start to see more automatic captioning delivered, particularly on television content and fundamentally that is probably a good thing, it increases accessibility, it's much more scalable, then the frantic people typing in the booth, trying to listen to what's going on the TV, and you'll be able to see automatic speech recognition engines that can handle a greater variety of content types.
Louise:
How my technologies like this apply to something like sign language, for example, which is nonverbal, but a communication system in it in a slightly different way?
Juliet:
Yeah, it's really interesting seeing the developments that are coming in and just some of the ideas that are coming in for how you can expand sign language translation, in-vision sign language translation on television. As you said, sign language, it's not really verbal communication, it's a combination of movement, and expression, and handshape, and all of these things together convey meaning, interestingly, it's a language. It's a language like any other language, there are different dialects, which is always something that people find interesting. The idea that a sign that is used to denote a particular term in Scotland might be different from a sign used for the same term in Wales. That said, the technology is advancing here, and one thing that we have seen is this idea of using avatars for sign language translation. Rather than needing to have a person, who pops up in the bottom of the screen, often working from a studio somewhere, you can create a machinery and technology that can put the movements that are so vital for the expression of sign language onto an avatar, which obviously again expands the ability to have that kind of access to television. But one thing that is really important there is that it's just vital that those avatars are photorealistic and that they are really accurate, because the movement and the shapes are so expressive. If you don't have something, an avatar that can represent all of that and really minute fine detail, it's a much poorer quality of translation, almost like listening to someone speak with a really thick accent, I guess would be the best way of explaining that. That's something that is being looked at the moment and I think it will probably be something that some of the technology does look into over the coming months or years. Yeah, I'm keeping an eye out on that one.
Louise:
That sounds like such a fascinating use case and the points that you've made around accents and accent interpretation really resonated as well as a fellow Northerner. Our spoken accents often aren't necessarily something that we think about objectively, they're just the way we communicate and then thinking obviously about achieving a technology that's trying to interpret that, yeah it's really fascinating. Shreya what are your thoughts on that piece?
Shreya:
The point you made about accents definitely resonates with me as well and one of the more fascinating things is about how words mean different things in different parts of the world, maybe that's a challenge we pose to our technologists in the future. On that note, I really just wanted to thank our guest Natalia and Juliet for joining us on this episode of the pod. One question I do have is Natalia and Juliet, if our listeners want to learn more about the topics that you have covered today, where should they go, Juliet?
Juliet:
Yeah, there has been a very interesting paper published by the European Broadcasting Union, that came out on the 27th of May, so you can search online for that, if you look for ‘Freedom to look, freedom to listen, progress in media accessibility technology’. That covers quite a lot of the topics that I have talked about today and it’s just an interesting read.
Louise:
Thanks Juliet. Natalia, any recommendations from you?
Natalia:
Yeah, I would just maybe suggest typing into any search engine ‘how to search more effectively using a search engine’. Because it's amazing how limited people are in using the search capabilities of the web search engines. If you think about it, you can use all the things like stop words or irrelevant words, you can search for the exact phrase or you can search for either one word or another. This is something that probably the listeners would find interesting in their day to day life, and they will spend less time trying to find information on the web.
Shreya:
Wonderful, thank you Natalia.
Louise:
Natalia and Juliet, thank you both so much for joining us today. I think it was such an insightful discussion and certainly raised some points that I was in no way familiar with, and listeners thank you for joining us for this episode. If you enjoyed it, of course don't forget to like and subscribe to the podcast and you can also keep your eyes peeled for our next instalment, when we will be exploring the letter ‘M’. You can find me on Twitter, as always @loutagtech, and Shreya at @shreyagop, and we will see you next time.
Louise Taggart
Cyber Threat Intelligence Senior Manager, PwC United Kingdom
Tel: +44 (0)7702 699119