DEMO: French Spoken Language Understanding with the new speech resources from NAVER LABS Europe

Community Article Published August 28, 2024

In this blog post we showcase the recent speech resources released by NAVER LABS Europe that will be presented at Interspeech 2024. The Speech-MASSIVE dataset is a multilingual spoken language understanding (SLU) dataset with rich metadata information, and the mHuBERT-147 model a compact and powerful speech foundation model with only 95M parameters and which supports 147 languages. Here we present a simple cascaded SLU application we built for French by leveraging both resources. You can check out our demo at:

You can check our demo at HuggingFace Spaces: https://huggingface.co/spaces/naver/French-SLU-DEMO-Interspeech2024

image/png

Table of Contents:

  1. About Speech-MASSIVE
  2. About mHuBERT-147
  3. Building an SLU application for French
  4. Meet us at Interspeech 2024!
  5. About NAVER LABS Europe speech research

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

SLU involves interpreting spoken input using Natural Language Processing (NLP). Voice assistants like Alexa and Siri are real-world examples of SLU applications. The core tasks in SLU include intent classification, which determines the goal or command behind an utterance, and slot-filling, which extracts specific details such as dates or music genres from the utterance. However, gathering SLU data has its challenges due to the complexities of recording, validation, and associated costs. Additionally, most existing datasets are primarily English-focused, limiting language diversity and cross-linguistic applications.

To address these gaps, we introduce Speech-MASSIVE, a multilingual SLU dataset that builds on the translations and annotations from the MASSIVE corpus. The dataset was curated through crowd-sourcing with strict quality controls and spans 12 languages (Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish and Vietnamese), covering 8 language families and 4 different scripts, with a total of over 83,000 spoken utterances.

Utilizing Speech-MASSIVE, we also established baseline models under various system and resource configurations to facilitate broad comparisons. You can read the details of the experiments in our paper. We hope this resource helps to advance multilingual SLU research and encourage the development of powerful end-to-end and cascaded models. The dataset is freely accessible on HuggingFace under the CC BY-NC-SA 4.0 license.

Here is how the Speech-MASSIVE data looks like. Our dataset is a fully aligned speech dataset across 12 different languages, which means that the dataset can also be used for ASR and ST across the 12 languages.

language utt audio intent slot spkr_age spkr_sex ... spkr_residence
French allumer les lumières du disco audio iot_hue_lighton allumer les [device_type : lumières] du [device_type : disco] 56 Male ... France
German mach die discobeleuchtung an audio iot_hue_lighton mach die [device_type : discobeleuchtung] an 20 Female ... Germany
Arabic شغل لمبات الديسكو audio iot_hue_lighton شغل [device_type : لمبات الديسكو] 32 Female ... United Kingdom
Spansih pon las luces de discoteca audio iot_hue_lighton pon las luces de discoteca 23 Male ... Mexico
Hungarian kapcsold be a diszkófényeket audio iot_hue_lighton kapcsold be a [device_type : diszkófényeket] 23 Female ... Hungary
Korean 디스코 불 켜 audio iot_hue_lighton [device_type : 디스코 불] 켜 34 Male ... Austrailia
Dutch zet de disco lichten aan audio iot_hue_lighton zet de [device_type : disco lichten] aan 23 Female ... Netherlands
Polish włącz światła dyskotekowe audio iot_hue_lighton włącz [device_type : światła dyskotekowe] 24 Female ... Poland
Portuguese ligar as luzes de discoteca audio iot_hue_lighton ligar as [device_type : luzes de discoteca] 22 Male ... Portugal
Russian включи диско освещение audio iot_hue_lighton включи [device_type : диско освещение] 32 Female ... United Kingdom
Turkish disko ışıklarını aç audio iot_hue_lighton [device_type : disko ışıklarını] aç 31 Female ... United Kingdom
Vietnamese mở các đèn disco lên audio iot_hue_lighton mở [device_type : các đèn disco] lên 33 Female ... United States

Below is an example of how to load the dataset:

from datasets import load_dataset, interleave_datasets, concatenate_datasets

# creating full train set by interleaving between German and French
speech_massive_de = load_dataset("FBK-MT/Speech-MASSIVE", "de-DE")
speech_massive_fr = load_dataset("FBK-MT/Speech-MASSIVE", "fr-FR")
speech_massive_train_de_fr = interleave_datasets([speech_massive_de['train'], speech_massive_fr['train']])

# creating train_115 few-shot set by concatenating Korean and Russian
speech_massive_ko = load_dataset("FBK-MT/Speech-MASSIVE", "ko-KR")
speech_massive_ru = load_dataset("FBK-MT/Speech-MASSIVE", "ru-RU")
Speech_massive_train_115_ko_ru = concatenate_datasets([speech_massive_ko['train_115'], speech_massive_ru['train_115']])

mHuBERT-147: a compact multilingual HuBERT model

Speech representation models form the foundation of most modern speech-related technologies. These models are trained using unsupervised learning on vast datasets, where deep encoder networks learn to capture rich, nuanced speech patterns. Once trained, they can be applied to various speech applications, often achieving impressive results even with minimal labeled data.

mHuBERT-147 is a compact yet highly effective multilingual speech representation model that supports 147 languages. It achieves an exceptional balance between performance and efficiency, ranking 2nd and 1st in the ML-SUPERB (10min/1h) leaderboards, whilst being 3 to 10 times smaller than its competitors. Moreover, our model is trained on nearly five times less data than comparable multilingual models, highlighting the crucial role of data curation in efficient self-supervised learning.

image/png

You can explore the mHuBERT-147 model on HuggingFace, where it’s available under the CC BY-NC-SA 4.0 license.

In the next section we will explain how to load mHuBERT-147 for ASR fine-tuning.

Building an SLU application for French

In this demo, we showcase a simple cascaded French SLU solution that leverages both Speech-MASSIVE and mHuBERT-147. We first build a French ASR model smaller and better than Whisper for speech-MASSIVE by using mHuBERT-147 as a backbone. We then feed our ASR predictions into the mT5 model fine-tuned for the Natural Language Understanding (NLU) tasks of Slot-filling and Intent classification. By putting together ASR and NLU, we build a simple cascaded SLU system.

1. Building a French ASR model using mHuBERT-147

We train a CTC-based ASR model using the mHuBERT-147 model as the backbone. Despite its compact size, mHuBERT-147 is highly efficient, making it a better choice for deployment where faster inference is important.

This French ASR model is available here: https://huggingface.co/naver/mHuBERT-147-ASR-fr

Training

We create the mHubertForCTC class which is nearly identical to the existing HubertForCTC class. The key difference is that we've added a few additional hidden layers at the end of the Transformer stack, just before the lm_head. We find that adding this extra capacity at the end of the encoder stack generally helps the model learn to produce characters in the target language(s) more efficiently.

class mHubertForCTC(HubertPreTrainedModel):
    def __init__(self, config, target_lang: Optional[str] = None):
        super().__init__(config)
        self.hubert = HubertModel(config)
        self.dropout = nn.Dropout(config.final_dropout)
        output_hidden_size = config.hidden_size
        self.has_interface = config.add_interface

        # NN layers on top of the trainable stack
        if config.add_interface:
            self.interface = nn.ModuleList([VanillaNN(output_hidden_size,output_hidden_size) for i in range(config.num_interface_layers)])
        self.lm_head = nn.Linear(output_hidden_size, config.vocab_size)
        self.post_init()        

Our hidden layers are simply defined neural networks made of a linear projection followed by a ReLU activation.

class VanillaNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        """
        simple NN with ReLU activation (no norm)
        """
        super().__init__()
        self.linear = nn.Linear(input_dim, output_dim)
        self.act_fn = nn.ReLU()

    def forward(self, hidden_states: torch.FloatTensor):
        hidden_states = self.linear(hidden_states)
        hidden_states = self.act_fn(hidden_states)
        return hidden_states

To initialize an ASR model from the existing pre-trained mHuBERT-147 model, first we need to create a processor. You can learn more about these from existing HuggingFace articles (here and here).

tokenizer = Wav2Vec2CTCTokenizer(your_vocab_file, unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('utter-project/mHuBERT-147')
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

For our ASR training, we then extend the HubertConfig file with our new parameters for the mHubertForCTC class.

# load Hubert default config
config = HubertConfig.from_pretrained('utter-project/mHuBERT-147')
# add CTC-related tokens
config.pad_token_id = processor.tokenizer.pad_token_id
config.ctc_token_id = processor.tokenizer.convert_tokens_to_ids('[CTC]')
config.vocab_size = len(processor.tokenizer)

# add our extra hidden layers
config.add_interface = True
config.num_interface_layers = 3
# update the dropout parameter
config.final_dropout = 0.3

Once this is done, we can instantiate a new mHubertForCTC model. Note that we still need to train this model before using it!

model = mHubertForCTC.from_pretrained('utter-project/mHuBERT-147', config=config)

TIPS: In general, for fine-tuning mHuBERT-147, we recommend final_dropout > 0.1. If you are experiencing instabilities during training, consider training in fp32 instead.

Inference

Our ASR inference scripts are available at: https://huggingface.co/naver/mHuBERT-147-ASR-fr/tree/main/inference_code

Inference is as simple as this:

from inference_code.run_inference import load_asr_model, run_asr_inference

audio_struct = librosa.load(your_audio_file, sr=16000)
model, processor = load_asr_model()
prediction = run_inference(model, processor, your_audio_file)

For this demo, we trained our ASR model using 123 hours of speech from three French datasets: fleurs-102, CommonVoice v17.0 (downsampled), and Speech-MASSIVE. We use Whisper normalization for training and evaluation. This small French ASR model is able to outperform whisper-large-v2 on Speech-MASSIVE dev and test sets.

dev WER dev CER test WER test CER
Whisper-large-v3 10.2 4.4 11.1 4.7
mHuBERT-147-ASR-fr 9.2 2.6 9.6 2.9

2. Leveraging mT5 for natural language understanding (NLU)

Now that we are able to transform French speech into text, the next step is to produce a NLU model able to give us the intent behind an utterance. For that, we fine-tune the mT5 model in a sequence-to-sequence manner for the NLU task. In this setting, the source sequence will be the N words, and the target sequence will be the corresponding slots plus the intent (N+1).

Concretely, input data looks like the example below. As suggested in previous work, we also prepend “Annotate: ” to the source sequence in order to cue the mT5 model for the NLU task.

Source - Annotate: allume les lumières dans la cuisine
Target - Other Other Other Other Other house_place iot_hue_lighton

For this demo, we use an mT5 model fine-tuned only on French NLU data from speech-MASSIVE but it is easy to extend this to a multilingual setting. You can explore different NLU settings and models on our Speech-MASSIVE paper.

The French NLU model we use in our demo is available at: https://huggingface.co/Beomseok-LEE/NLU-Speech-MASSIVE-finetune

Here is how you load it:

from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer

config=AutoConfig.from_pretrained("Beomseok-LEE/NLU-Speech-MASSIVE-finetune")
tokenizer=AutoTokenizer.from_pretrained("Beomseok-LEE/NLU-Speech-MASSIVE-finetune")
model=AutoModelForSeq2SeqLM.from_pretrained("Beomseok-LEE/NLU-Speech-MASSIVE-finetune")

Then, the NLU inference code is very simple. We simply prepend “Annotate: “ to the input string (ASR output in our case) and pass it to the NLU model we fine-tuned.

  example = "Annotate: " + example
  input_values = tokenizer(example, max_length=128, padding=False, truncation=True, return_tensors="pt").input_ids

  with torch.no_grad():
    pred_ids = model.generate(input_values)
  prediction = tokenizer.decode(pred_ids[0], skip_special_tokens=True)
  splitted_pred = prediction.strip().split()

As our NLU model is trained in a sequence-to-sequence manner, the output will be made of the slot-filling tokens and the intent classification token, so we split the model output into the corresponding parts as follows:

  slots_prediction = splitted_pred[:-1]
  intent_prediction = splitted_pred[-1]

Now, we have all the components for our cascaded SLU system!

3. Demo time!

Our demo is hosted at HuggingFace spaces and available at: https://huggingface.co/spaces/naver/French-SLU-DEMO-Interspeech2024

If you speak French, try the microphone. If you don't, just click on the examples available in the demo page and have fun!

image/png

Note that the demo is only using CPU resources, so processing time may vary.

Meet us at Interspeech 2024

If you want to know more about our resources, or NAVER LABS Europe, don’t hesitate to look for us at Interspeech 2024. Both authors (Beomseok Lee and Marcely Zanon Boito) will be there all week!

Our presentations:

On Monday, September 2nd, from 4:00 to 4:20 pm:

  • "Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond", Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier
  • Location: Iasso Room
  • Paper: https://arxiv.org/abs/2408.03900

On Wednesday, September 4th, from 4:00 to 6:00 pm:

  • "mHuBERT-147: A Compact Multilingual HuBERT Model", Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu
  • Location: Poster Area 2A
  • Paper: https://arxiv.org/abs/2406.06371

About us:

The NAVER LABS EUROPE Interactive Systems group aims to equip robots to interact safely with humans, other robots and systems. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming. By leveraging multimodal data and models, we believe we can create more robust and user-friendly interfaces for robotic services. This work, centered on multi-modality, also encompasses multi-tasking and multilinguality.

Find out more at: https://europe.naverlabs.com/research/multimodal-nlp-for-hri/

This blog post was written by Beomseok Lee and Marcely Zanon Boito. We thank Laurent Besacier and Ioan Calapodescu for reviewing its content.

Aknowledgments:

This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) funded by European Union’s Horizon Europe Research and Innovation programme under grant agreement number 101070631. For more information please visit https://he-utter.eu/