BenjaminGalliot commited on
Commit
7447f9b
1 Parent(s): 88456c4

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +188 -0
  2. pangloss.py +200 -0
  3. test.py +3 -0
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pretty_name: Pangloss
3
+ annotations_creators:
4
+ - expert-generated
5
+ language_creators:
6
+ - expert-generated
7
+ language:
8
+ - jya
9
+ - nru
10
+ language_bcp47:
11
+ - x-japh1234
12
+ - x-yong1288
13
+ language_details: jya consists of japh1234 (Glottolog code); nru consists of yong1288 (Glottolog code)
14
+ license: cc-by-nc-sa-4.0
15
+ multilinguality:
16
+ - multilingual
17
+ - translation
18
+ size_categories:
19
+ yong1288:
20
+ - "10K<n<100K"
21
+ japh1234:
22
+ - "10K<n<100K"
23
+ source_datasets:
24
+ - original
25
+ task_categories:
26
+ - automatic-speech-recognition
27
+ task_ids:
28
+ - speech-recognition
29
+
30
+
31
+ configs:
32
+ - config_name: yong1288
33
+ data_files:
34
+ - split: train
35
+ path: "yong1288/train.csv"
36
+ - split: test
37
+ path: "yong1288/test.csv"
38
+ - split: validation
39
+ path: "yong1288/validation.csv"
40
+ - config_name: japh1234
41
+ data_files:
42
+ - split: train
43
+ path: "japh1234/train.csv"
44
+ - split: test
45
+ path: "japh1234/test.csv"
46
+ - split: validation
47
+ path: "japh1234/validation.csv"
48
+ ---
49
+
50
+ # Dataset Card for [Needs More Information]
51
+
52
+ ## Table of Contents
53
+ - [Dataset Description](#dataset-description)
54
+ - [Dataset Summary](#dataset-summary)
55
+ - [Supported Tasks](#supported-tasks-and-leaderboards)
56
+ - [Languages](#languages)
57
+ - [Dataset Structure](#dataset-structure)
58
+ - [Data Instances](#data-instances)
59
+ - [Data Fields](#data-instances)
60
+ - [Data Splits](#data-instances)
61
+ - [Dataset Creation](#dataset-creation)
62
+ - [Curation Rationale](#curation-rationale)
63
+ - [Source Data](#source-data)
64
+ - [Annotations](#annotations)
65
+ - [Personal and Sensitive Information](#personal-and-sensitive-information)
66
+ - [Considerations for Using the Data](#considerations-for-using-the-data)
67
+ - [Social Impact of Dataset](#social-impact-of-dataset)
68
+ - [Discussion of Biases](#discussion-of-biases)
69
+ - [Other Known Limitations](#other-known-limitations)
70
+ - [Additional Information](#additional-information)
71
+ - [Dataset Curators](#dataset-curators)
72
+ - [Licensing Information](#licensing-information)
73
+ - [Citation Information](#citation-information)
74
+
75
+ ## Dataset Description
76
+
77
+ - **Homepage:** [Web interface of the Pangloss Collection, which hosts the data sets](https://pangloss.cnrs.fr/)
78
+ - **Repository:** [GithHub repository of the Pangloss Collection, which hosts the data sets](https://github.com/CNRS-LACITO/Pangloss/)
79
+ - **Paper:** [A paper about the Pangloss Collection, including a presentation of the Document Type Definition](https://halshs.archives-ouvertes.fr/halshs-01003734)
80
+ [A paper in French about the deposit in Zenodo](https://halshs.archives-ouvertes.fr/halshs-03475436)
81
+ - **Leaderboard:** [Needs More Information]
82
+ - **Point of Contact:** [Benjamin Galliot](mailto:[email protected])
83
+
84
+ ### Dataset Summary
85
+
86
+ Two audio corpora of minority languages of China (Japhug and Na), with transcriptions, proposed as reference data sets for experiments in Natural Language Processing. The data, collected and transcribed in the course of immersion fieldwork, amount to a total of about 1,900 minutes in Japhug and 200 minutes in Na. By making them available in an easily accessible and usable form, we hope to facilitate the development and deployment of state-of-the-art NLP tools for the full range of human languages. There is an associated tool for assembling datasets from the Pangloss Collection (an open archive) in a way that ensures full reproducibility of experiments conducted on these data.
87
+ The Document Type Definition for the XML files is available here:
88
+ http://cocoon.huma-num.fr/schemas/Archive.dtd
89
+
90
+ ### Supported Tasks and Leaderboards
91
+
92
+ [Needs More Information]
93
+
94
+ ### Languages
95
+
96
+ Japhug (ISO 639-3 code: jya, Glottolog language code: japh1234) and Yongning Na (ISO 639-3 code: nru, Glottolog language code: yong1288) are two minority languages of China. The documents in the dataset have a transcription in the endangered language. Some of the documents have translations into French, English, and Chinese.
97
+
98
+ ## Dataset Structure
99
+
100
+ ### Data Instances
101
+
102
+ A typical data row includes the path, audio, sentence, document type and several translations (depending on the sub-corpus).
103
+
104
+ `
105
+ {
106
+ "path": "cocoon-db3cf0e1-30bb-3225-b012-019252bb4f4d_C1/Tone_BodyPartsOfAnimals_12_F4_2008_withEGG_069.wav",
107
+ "audio": "{'path': 'na/cocoon-db3cf0e1-30bb-3225-b012-019252bb4f4d_C1/Tone_BodyPartsOfAnimals_12_F4_2008_withEGG_069.wav', 'array': array([0.00018311, 0.00015259, 0.00021362, ..., 0.00030518, 0.00030518, 0.00054932], dtype=float32), 'sampling_rate': 16000}",
108
+ "sentence": "ʈʂʰɯ˧ | ɖɤ˧mi˧-ɬi˧pi˩ ɲi˩",
109
+ "doctype": "WORDLIST",
110
+ "translation:zh": "狐狸的耳朵",
111
+ "translation:fr": "oreilles de renard",
112
+ "translation:en": "fox's ears",
113
+ }
114
+ `
115
+
116
+ ### Data Fields
117
+
118
+ path: the path to the audio file;;
119
+
120
+ audio: a dictionary containing the path to the audio file, the audio array and the sampling rate;
121
+
122
+ sentence: the sentence the native has pronunced;
123
+
124
+ doctype: the document type (a text or a word list);
125
+
126
+ translation:XX: the translation of the sentence in the language XX.
127
+
128
+ ### Data Splits
129
+
130
+ The train, test and validation splits have all been reviewed and were splitted randomly (ratio 8:1:1) at sentence level (after the extraction from various files).
131
+
132
+ ## Dataset Creation
133
+
134
+ ### Curation Rationale
135
+
136
+ [Needs More Information]
137
+
138
+ ### Source Data
139
+
140
+ #### Initial Data Collection and Normalization
141
+
142
+ [Needs More Information]
143
+
144
+ #### Who are the source language producers?
145
+
146
+ [Needs More Information]
147
+
148
+ ### Annotations
149
+
150
+ #### Annotation process
151
+
152
+ [Needs More Information]
153
+
154
+ #### Who are the annotators?
155
+
156
+ [Needs More Information]
157
+
158
+ ### Personal and Sensitive Information
159
+
160
+ [Needs More Information]
161
+
162
+ ## Considerations for Using the Data
163
+
164
+ ### Social Impact of Dataset
165
+
166
+ The dataset was collected in immersion fieldwork for language documentation. It contributes to the documentation and study of the world's languages by providing documents of connected, spontaneous speech recorded in their cultural context and transcribed in consultation with native speakers. The impacts concern research, and society at large: a guiding principle of the Pangloss Collection, which hosts the data sets, is that a close association between documentation and research is highly profitable to both. A range of possibilities for uses exist, for the scientific and speaker communities and for the general public.
167
+
168
+ ### Discussion of Biases
169
+
170
+ The corpora are single-speaker and hence clearly do not reflect the sociolinguistic and dialectal diversity of the languages. No claim is made that the language variety described constitutes a 'standard'.
171
+
172
+ ### Other Known Limitations
173
+
174
+ The translations are entirely hand-made by experts working on these languages; the amount and type of translations available varies from document to document, as not all documents have translations and not all translated documents have the same translation languages (Chinese, French, English...).
175
+
176
+ ## Additional Information
177
+
178
+ ### Dataset Curators
179
+
180
+ [Needs More Information]
181
+
182
+ ### Licensing Information
183
+
184
+ [Needs More Information]
185
+
186
+ ### Citation Information
187
+
188
+ [Needs More Information]
pangloss.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """Pangloss datasets for Yongning Na (yong1288) and Japhug (japh1234)"""
15
+
16
+ import csv
17
+ import json
18
+ import os
19
+ import datasets
20
+ from datasets.tasks import AutomaticSpeechRecognition
21
+
22
+ _CITATION = {
23
+ "yong1288": """
24
+ @misc{michaud_alexis_2021_5336698,
25
+ author = {Michaud, Alexis and
26
+ Galliot, Benjamin and
27
+ Guillaume, Séverine},
28
+ title = {{Yongning Na for Natural Language Processing: a
29
+ single-speaker audio corpus with transcriptions}},
30
+ month = aug,
31
+ year = 2021,
32
+ publisher = {Zenodo},
33
+ version = {1.0},
34
+ doi = {10.5281/zenodo.5336698},
35
+ url = {https://doi.org/10.5281/zenodo.5336698}
36
+ }
37
+ """,
38
+ "japh1234": """\
39
+ @misc{jacques_guillaume_2021_5521112,
40
+ author = {Jacques, Guillaume and
41
+ Galliot, Benjamin and
42
+ Guillaume, Séverine},
43
+ title = {{Japhug for Natural Language Processing: a single-
44
+ speaker audio corpus with transcriptions}},
45
+ month = sep,
46
+ year = 2021,
47
+ publisher = {Zenodo},
48
+ version = {1.0},
49
+ doi = {10.5281/zenodo.5521112},
50
+ url = {https://doi.org/10.5281/zenodo.5521112}
51
+ }
52
+ """
53
+ }
54
+
55
+ _DESCRIPTION = """\
56
+ These datasets are extracts from the Pangloss collection and have
57
+ been preprocessed for ASR experiments in Na and Japhug.
58
+ """
59
+
60
+ _HOMEPAGE = "https://pangloss.cnrs.fr/"
61
+
62
+ _LICENSE = "https://creativecommons.org/licenses/by-nc-sa/4.0/fr/legalcode"
63
+
64
+ # The HuggingFace Datasets library doesn't host the datasets but only points to the original files.
65
+ # This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
66
+
67
+ _VERSION = datasets.Version("1.0.0")
68
+
69
+ _LANGUAGES = {
70
+ "yong1288": {
71
+ "url": "https://mycore.core-cloud.net/index.php/s/vaGMeRf4Iij8MWR/download",
72
+ "homepage": "https://zenodo.org/record/5336698",
73
+ "description": "Yongning Na dataset",
74
+ "translations": ["fr", "en", "zh"]
75
+ },
76
+ "japh1234": {
77
+ "url": "https://mycore.core-cloud.net/index.php/s/kuQCxmyVcUFWroV/download",
78
+ "homepage": "https://zenodo.org/record/5521112",
79
+ "description": "Japhug dataset",
80
+ "translations": ["fr", "zh"]
81
+ }
82
+ }
83
+
84
+ # TODO: Name of the dataset usually match the script name with CamelCase instead of snake_case
85
+ class PanglossDataset(datasets.GeneratorBasedBuilder):
86
+ """The Pangloss datasets are extracts from Pangloss Collections that can be used for ASR experiments in these languages."""
87
+ field_translations = {
88
+ "chemin_audio": "path",
89
+ "nature": "doctype",
90
+ "forme": "sentence",
91
+ "traduction:fr": "translation:fr",
92
+ "traduction:en": "translation:en",
93
+ "traduction:zh": "translation:zh"
94
+ }
95
+
96
+ # This is an example of a dataset with multiple configurations.
97
+ # If you don't want/need to define several sub-sets in your dataset,
98
+ # just remove the BUILDER_CONFIG_CLASS and the BUILDER_CONFIGS attributes.
99
+
100
+ # If you need to make complex sub-parts in the datasets with configurable options
101
+ # You can create your own builder configuration class to store attribute, inheriting from datasets.BuilderConfig
102
+ # BUILDER_CONFIG_CLASS = MyBuilderConfig
103
+
104
+ # You will be able to load one or the other configurations in the following list with
105
+ # data = datasets.load_dataset('my_dataset', 'first_domain')
106
+ # data = datasets.load_dataset('my_dataset', 'second_domain')
107
+ BUILDER_CONFIGS = [
108
+ datasets.BuilderConfig(name=language_name, version=_VERSION, description=language_data["description"])
109
+ for language_name, language_data in _LANGUAGES.items()
110
+ ]
111
+
112
+ #DEFAULT_CONFIG_NAME = "na" # It's not mandatory to have a default configuration. Just use one if it make sense.
113
+
114
+ def _info(self):
115
+ # TODO: This method specifies the datasets.DatasetInfo object which contains informations and typings for the dataset
116
+ features = datasets.Features(
117
+ {
118
+ "path": datasets.Value("string"),
119
+ "audio": datasets.features.Audio(sampling_rate=16_000),
120
+ "sentence": datasets.Value("string"),
121
+ "doctype": datasets.Value("string"),
122
+ **{f"translation:{language_code}": datasets.Value("string") for language_code in _LANGUAGES[self.config.name]["translations"]}
123
+ }
124
+ )
125
+
126
+ return datasets.DatasetInfo(
127
+ # This is the description that will appear on the datasets page.
128
+ description=_DESCRIPTION,
129
+ # This defines the different columns of the dataset and their types
130
+ features=features, # Here we define them above because they are different between the two configurations
131
+ # If there's a common (input, target) tuple from the features, uncomment supervised_keys line below and
132
+ # specify them. They'll be used if as_supervised=True in builder.as_dataset.
133
+ # supervised_keys=("sentence", "label"),
134
+ # Homepage of the dataset for documentation
135
+ homepage=_HOMEPAGE,
136
+ # License for the dataset if available
137
+ license=_LICENSE,
138
+ # Citation for the dataset
139
+ citation=_CITATION,
140
+ task_templates=[AutomaticSpeechRecognition(audio_column="audio", transcription_column="forme")],
141
+
142
+ )
143
+
144
+ def _split_generators(self, dl_manager):
145
+ # TODO: This method is tasked with downloading/extracting the data and defining the splits depending on the configuration
146
+ # If several configurations are possible (listed in BUILDER_CONFIGS), the configuration selected by the user is in self.config.name
147
+
148
+ # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLS
149
+ # It can accept any type or nested list/dict and will give back the same structure with the url replaced with path to local files.
150
+ # By default the archives will be extracted and a path to a cached folder where they are extracted is returned instead of the archive
151
+ urls = _LANGUAGES[self.config.name]["url"]
152
+ data_dir = dl_manager.download_and_extract(urls)
153
+ return [
154
+ datasets.SplitGenerator(
155
+ name=datasets.Split.TRAIN,
156
+ # These kwargs will be passed to _generate_examples
157
+ gen_kwargs={
158
+ "filepath": os.path.join(data_dir, self.config.name, "train.csv"),
159
+ "split": "train"
160
+ },
161
+ ),
162
+ datasets.SplitGenerator(
163
+ name=datasets.Split.TEST,
164
+ # These kwargs will be passed to _generate_examples
165
+ gen_kwargs={
166
+ "filepath": os.path.join(data_dir, self.config.name, "test.csv"),
167
+ "split": "test"
168
+ },
169
+ ),
170
+ datasets.SplitGenerator(
171
+ name=datasets.Split.VALIDATION,
172
+ # These kwargs will be passed to _generate_examples
173
+ gen_kwargs={
174
+ "filepath": os.path.join(data_dir, self.config.name, "validation.csv"),
175
+ "split": "validation"
176
+ },
177
+ ),
178
+ ]
179
+
180
+ # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
181
+ def _generate_examples(self, filepath, split):
182
+ # TODO: This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
183
+ # The `key` is for legacy reasons (tfds) and is not important in itself, but must be unique for each example.
184
+ with open(filepath, encoding="utf-8") as file_descriptor:
185
+ reader = csv.DictReader(file_descriptor)
186
+ for key, row in enumerate(reader):
187
+ translated_fieldnames = [self.field_translations[fieldname] for fieldname in reader.fieldnames if fieldname in self.field_translations.keys()]
188
+ data = dict(zip(translated_fieldnames, row.values()))
189
+ data["audio"] = os.path.join(os.path.dirname(filepath), data["path"])
190
+ # Yields examples as (key, example) tuples
191
+ yield key, data
192
+
193
+
194
+ if __name__ == "__main__":
195
+ # for language in _LANGUAGES.keys():
196
+ datasets.load_dataset("pangloss.py", "japh1234")
197
+
198
+ # datasets-cli test datasets/pangloss --save_infos --all_configs
199
+ # datasets-cli dummy_data datasets/pangloss --auto_generate
200
+
test.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ from datasets import load_dataset
2
+
3
+ dataset = load_dataset("Lacito/pangloss", "japh1234")