Fine-tuning ASR

Fine-tuning a general-purpose automatic speech recognizer (ASR) for Dutch with medical consultations on the use of pharmaceuticals

I am Cristian Tejedor García, from Valladolid, Spain. I am pleased to introduce myself as one of the researchers in the HoMed project. I will explain how I expect my academic profile, background and experience to connect to my research activities within HoMed.

I received the B.Sc. and M.Sc. (Hons.) degrees in Computer Engineering and the Ph.D. (Cum Laude, international mention) in Computer Science from the University of Valladolid, Valladolid, Spain, in 2014, 2016, and 2020, respectively. I am currently a postdoctoral researcher in speech technology at Centre for Language and Speech Technology, (CLST, Radboud University Nijmegen, The Netherlands) and an honorary collaborator of the ECA-SIMM Research Group (University of Valladolid).


Research Interests

My research interests and experience span a wide range of topics in pronunciation training with speech technologies from advanced software development with the Kaldi speech recognition environment, to the analysis and design of pronunciation learning applications and human–computer interaction. These interests are directly related to my thesis "Design and Evaluation of Mobile Computer-Assisted Pronunciation Training Tools for Second Language Learning".
Through the Ph.D. degree, my course work has covered a wide range of topics in Computer Science. My research in computer-assisted pronunciation training area has provided me with the opportunity to understand automatic speech recognition (ASR), speech synthesis and human-computer interaction by means of empirical research with real users. Since 2015, I published almost twenty papers in internationally reputable and indexed journals and conferences, such as IEEE Transactions on Learning Technologies journal (2020), IEEE Access journal (2020), InterSPEECH (2016), EDULEARN (2020) or SLATE (2017), among others. My publications can be found at

During my first months in HoMed project I have been finding my way with the general-purpose Dutch ASR system (Kaldi NL) and other webservices provided by CLST. Kaldi_NL is based on the Spoken Dutch Corpus (CGN) which contains Dutch recordings spoken by speakers (age range 18-65 years old) from all over the Netherlands and Flanders. The total amount of training material is 483 hours, spoken by 1185 female and 1678 male speakers. This system will be the baseline for the fine-tune ASR for HoMed, so that we will improve the current models by adapting them to the health topic.

As you might know, ASR is a technology that converts speech to text in real time. Essentially, we can use ASR to identify and verify the identity of the speakers, and as a second step, we can use natural language understanding to determine what their intentions are. We could also combine ASR with other technologies, such as a chat-bots or text-to-speech to try to keep a conversation with the speaker. Although real-time analysis of the speech data is not trivial, in the nearby future we will be able to assist advisors with very accurate automated translation and emotion detection of the speech. You can find state-of-the-art ASR technology in virtual assistants at home or in dictations systems.

How ASR works

You might wonder how ASR works. Imagine you have a speech signal that is digitized or you have several digital speech recordings on your computer. The first step that you have to do is you make a time window over 25 or 30 milliseconds to calculate the speech features and then you move the window at 10 milliseconds. You repeat this process again and again. That is the way that you process the digital speech into small parts, as up to three times overlap, and you calculate the features there that results in a set of phonemes.  Basically, an acoustic model is a set of most probably phonemes. For instance, if you use Dutch acoustic models with the parameters that they are built on, a set of phonemes and the correct text output, that means that you have each window with an estimation of the most likely phonemes that are set.

asr fundamentals

Language Model

The next important part of ASR is the language model. It predicts (the most likely) words given to three or four words before with a statistical model. So, depending on the context and depending on the language model the ASR outcome can be different. There is a difference between the way we write and we are speaking so ideally speaking we need a lot of transcriptions from actual conversations and we need to make a transcription and use the transcription as a language model. However it costs a lot of money and it is a lot of work to collect enough data, so to help the system we are collecting audio/video material related to pharmaceuticals to add that to the transcriptions to build the language model. We also need a lexicon, that is, a list of non-repeated words included in the language model with their phonetic transcriptions.

In health environment we speak a little bit different since we use other words related to medicines, doctors, illnesses, etc. So, in a health language model we need to include the medicine and pharmaceutical words. You can do that by adding to the language model a lot of text from transcriptions from book, newspapers or videos which talk about health issues. So, you change your existing standard language model into a health language model (you could apply this process to other areas, such as the Dutch Police or Nederlandsche Bank). In summary, to create a health language model, you add different texts which are spoken about a certain small topic (health) and add them to the standard language model.


The training data must be constantly updated. For example, you will not obtain good results with contemporary Dutch speech using an ASR trained on speech data of the 80s. Also ASR struggles with the large variation in speech due to age, gender, speech impairment, race, and accents. That is the reason why you must fine-tune your ASR to the specific context of application. In HoMed, +200 hours of real-life recordings (video/audio) of medical visits with patients with either Chronic obstructive pulmonary disease (COPD) or cancer, and patient visits to the pharmacists will be added as training data. Part of this data will be transcribed at Nivel, since it is very sensitive data.

All in all, you have your specific speech data as input, you do the feature extraction and calculate the features, you try to get the most likely phonemes and you put all that information inside a search algorithm. This algorithm uses the language model and also a lexicon, so that there are words which are possible recognizable and the combination of (1) lexicon and the (2) language model and (3) phonetic representation gives you an hypothesis of the text you can recognize and the first hypothesis is the text you show as output. However, when you generate your hypothesis you may notice some errors: some particular words are not included in the lexicon or even in the language model. In this case you must retrain your models with this new information and process the same phrases and then you hope that those particular words will be recognized as good as the other words. So, it is a constant updating of your language model by generating the hypothesis asking people if they can control the outcome if it is okay it is okay if not they will change the outcome and that text is sent to the lexicon and language model and then the language model is retrained this the corrected text.

Performance of the results

In the recent years by the use of deep neural nets or artificial intelligence we see that under optimal condition ASR is almost as good as that of humans. For instance, if you record some utterances and ask people to provide the verbatim transcription there might be some mistakes. In particular, the human rate accuracy is 95% when transcribing manually speech data. A good ASR system perfomance should be close to that percentage.

We can calculate the performance of an ASR system with the metric Word Error Rate (WER). It is the number of errors divided by the total words. That is, the number of substitutions, deletions and insertions factored by the total amount of words. So, the higher the value, the worse the performance. If we have speech data of quiet conversations in a noise-free surrounding by native speakers who really speak in a grammatically correct way then the WER value could be 5%. This value will be worse if we are using daily conversations with some dialect and accents in a normal environment (with some background noise). Finally if we have telephony conversations with dialects and heavy accents by non-native speakers in a noisy environment, the final performance will be the worst possible. So WER heavily depends on the quality of the input speech.


To conclude, in HoMed I am part of a multidisciplinary team with a lot of expertise and experience with ASR and data preparation, which means I will be able to develop new skills from different areas. I will work closely with the other postdoctoral social scientist in the project, Berrie, who is an expert in media studies and data visualization. I will be responsible for not only the language and acoustic modeling of the ASR work within the project, but I also intend to learn the basics of visualization techniques of the pharmaceutical data.

This is my first post for the blog and I hope you find it useful. I expect we can start obtaining results as soon as possible. Please, do not hesitate to contact me if you have any question or concern.



Speech recognition may be useful during medication interviews

The article below is an (automatic) translation of the original Dutch article that can be read here.

Automatic speech recognition (ASR) is attracting increasing interest from the medical field and may also mean something for pharmacists. A survey has shown that pharmacists are particularly interested in its use in medication review meetings to reduce the administrative workload. However, most are still taking a wait-and-see attitude.


PW36 - 09-09-2022 | by John Davelaar, Berrie van der Molen, and Toine Pieters

Pharmacists have a wide and diverse range of tasks, which can differ enormously from those of other pharmacists, depending on the sector in which they work. However, one element is often the same: time pressure. New systems are regularly introduced to make work faster and more efficient, such as shortcut keys and user-friendly interfaces. Automatic speech recognition (ASR) may eventually be added to this.

We talked to ten pharmacists from different sectors (pharmaceutical industry, community pharmacy, hospital pharmacy and the research world) to see how they think ASR could be implemented in daily practice, what advantages and disadvantages are involved and what possible objections are foreseen.

ASR enables the conversion of spoken text into written, searchable text. The technology must be able to recognise the language spoken and be able to cope with different acoustic conditions [1]. An everyday example of this is the virtual smartphone assistant that you can ask questions to. The digital assistant converts spoken text, recognises your question and then searches for an answer. Even in some pharmacies, basic applications of speech recognition are already in place, for example when patients call the pharmacy, they first hear a menu with options and are connected to the right employee based on what they say. For application in specialist areas, such as healthcare, there are additional challenges: recognition of complex medical terms and specific acoustic settings such as consulting rooms and counter discussions.
Although the application of ASR in healthcare is not without its challenges, some doctors are already using it, for example during patient record-keeping, in order to reduce the administrative workload. In addition, the privacy sensitivity of medical calls must be taken into account. The HoMed research project is working on ASR of doctor-patient and pharmacist-patient conversations. We do this by using archived sound recordings of the Netherlands Institute for Health Research (Nivel).

Upon completion of the project, HoMed's ASR infrastructure will be made freely available for further applications in the Dutch healthcare domain. Reason enough to anticipate what ASR could mean for pharmaceutical care. In the questionnaire and interviews, we asked pharmacists how ASR could support pharmaceutical care and what the advantages and disadvantages might be. Our interviews revealed that pharmacists can mainly envision using ASR when conducting medication reviews. The advantage of this is that pharmacists can then concentrate fully on what the patient is saying and not on the typing, which means that fewer things are overlooked. Because the system eliminates the need for typing, the administrative workload may be reduced. "You are now talking and typing, but that does not always run smoothly," says a pharmacist. A disadvantage mentioned is that often only a small part of the medication review is relevant, while the programme records everything.
A pharmacist remarked: "I don't want to have to read through chunks of text to see what is important and what is not; too much is too much. This shows that there is not only a need for recognition of the spoken text, but also for more advanced applications such as recognition of terms, names or specific information in a conversation. In addition, ASR can be used to make many hours of conversation recordings searchable using keywords. As a result, important medical issues, such as the recognition of medication abuse and therapy non-adherence, can be investigated [2]. This can lead to pharmacists recognising such behaviour earlier and thus being able to respond in a timely and appropriate manner.

Possible objections

There is some reticence among pharmacists regarding the positive predictive ability of automatic speech recognition. As indicated earlier, they foresee problems with the conversion of spoken words to text, as a result of which pharmacists expect to have to check the noted text for errors, which would take time. They expect that especially complex drug names will lead to problems with speech recognition.

In the HoMed research project, this is one of the challenges we are currently working on: how can you ensure that the automatic speech recognition system is able to recognise as many pharmaceutical and medical terms as possible, especially as these are often pronounced in different ways? For this we make grateful use of lists of terms that have been made available to HoMed by the Medicines Evaluation Board (CBG), Health Base Foundation, the International Council for Harmonisation of Technical Requirements Registration Pharmaceuticals Human Use (ICH), Institute for the Dutch Language (INT), Institute for Responsible Medicine Use (IVM), KNMP, NHG, Nictiz and the Netherlands Care Institute (Zorginstituut Nederland).

There are also concerns that patients will be less free to speak out about embarrassing events and side effects if conversations are recorded for ASR purposes. The privacy challenges are part of HoMed, as this also concerns the practice material used to test the infrastructure, a necessary stage in the development of ASR for specific domains. The speech recogniser must not only be able to recognise medical text, but also be delivered with a method for privacy assurance.

The fact that several pharmacists are often in one room is mentioned as a potential problem for the effective deployment of ASR; are those acoustic conditions optimal? One pharmacist is concerned: "Are we all going to have to wear headphones so that the sound is well-captured and we are less affected by background noise? It's like we're going to become a call centre!"

The research within HoMed, in which we also investigate pharmacy conversations, will have to reveal this. The costs of the materials and the software itself are mentioned: "What kind of price tag will this carry?" Will these costs yield a profit below the line, pharmacists wonder. Many of the objections mentioned already have HoMed's attention, so solutions may come more quickly.

Food for thought

The tour of the pharmaceutical field for this article is an initial exploration of the possibilities and challenges of automatic speech recognition in pharmaceutical care. Pharmacists see the added value, but there are also reservations about its effectiveness. The aforementioned objections are currently being investigated in the development of ASR for the medical domain within HoMed, which is encouraging.

We expect to make the tool available through the Open Speech Technology Foundation in late 2023. Pharmacists who have their own ideas about specific applications in pharmaceutical practice can contact us.

 Homed: infrastructure speechrecognition

Homo Medicinalis (HoMed) is a collaborative project between Radboud University, Utrecht University, Nivel, Twente University and the Netherlands Institute for Sound and Vision, working on an infrastructure for automatic speech recognition of the Dutch medical domain. In the project, language and speech recognition technology is combined with media expertise and medical domain knowledge to meet this challenge.

More information on

John Davelaar is an honours student in pharmacy. Berrie van der Molen is a researcher at the Freudenthal Institute. Toine Pieters is professor at the Discipline Group Pharmacoepidemiology and Clinical Pharmacology, Department of Pharmaceutical Sciences at Utrecht University.


  1. Malik M, Malik MK, Mehmood K, Makhdoom Automatic speech recognition: a survey. Multimed Tools Appl. 2021;80(6):9411–57.
  1. Tejedor-García C, van der Molen B, van den Heuvel H, van Hessen A, Pieters Towards an Open-Source Dutch Speech Recognition System for the Healthcare Domain. In: Proceedings of the 13th Conference on Language Resources and Evaluation. Marseille: ELRA; 2022. p. 1032–9.

UT Logo Black EN

UT Logo Black EN

UT Logo Black EN

UT Logo Black EN

UT Logo Black EN