Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +188 -0
pangloss.py +200 -0
test.py +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,188 @@

+---
+pretty_name: Pangloss
+annotations_creators:
+- expert-generated
+language_creators:
+- expert-generated
+language:
+- jya
+- nru
+language_bcp47:
+- x-japh1234
+- x-yong1288
+language_details: jya consists of japh1234 (Glottolog code); nru consists of yong1288 (Glottolog code)
+license: cc-by-nc-sa-4.0
+multilinguality:
+- multilingual
+- translation
+size_categories:
+  yong1288:
+  - "10K<n<100K"
+  japh1234:
+  - "10K<n<100K"
+source_datasets:
+- original
+task_categories:
+- automatic-speech-recognition
+task_ids:
+- speech-recognition
+configs:
+- config_name: yong1288
+  data_files:
+  - split: train
+    path: "yong1288/train.csv"
+  - split: test
+    path: "yong1288/test.csv"
+  - split: validation
+    path: "yong1288/validation.csv"
+- config_name: japh1234
+  data_files:
+  - split: train
+    path: "japh1234/train.csv"
+  - split: test
+    path: "japh1234/test.csv"
+  - split: validation
+    path: "japh1234/validation.csv"
+---
+# Dataset Card for [Needs More Information]
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+## Dataset Description
+- **Homepage:** [Web interface of the Pangloss Collection, which hosts the data sets](https://pangloss.cnrs.fr/)
+- **Repository:** [GithHub repository of the Pangloss Collection, which hosts the data sets](https://github.com/CNRS-LACITO/Pangloss/)
+- **Paper:** [A paper about the Pangloss Collection, including a presentation of the Document Type Definition](https://halshs.archives-ouvertes.fr/halshs-01003734)
+[A paper in French about the deposit in Zenodo](https://halshs.archives-ouvertes.fr/halshs-03475436)
+- **Leaderboard:** [Needs More Information]
+- **Point of Contact:** [Benjamin Galliot](mailto:[email protected])
+### Dataset Summary
+Two audio corpora of minority languages of China (Japhug and Na), with transcriptions, proposed as reference data sets for experiments in Natural Language Processing. The data, collected and transcribed in the course of immersion fieldwork, amount to a total of about 1,900 minutes in Japhug and 200 minutes in Na. By making them available in an easily accessible and usable form, we hope to facilitate the development and deployment of state-of-the-art NLP tools for the full range of human languages. There is an associated tool for assembling datasets from the Pangloss Collection (an open archive) in a way that ensures full reproducibility of experiments conducted on these data.
+The Document Type Definition for the XML files is available here:
+http://cocoon.huma-num.fr/schemas/Archive.dtd
+### Supported Tasks and Leaderboards
+[Needs More Information]
+### Languages
+Japhug (ISO 639-3 code: jya, Glottolog language code: japh1234) and Yongning Na (ISO 639-3 code: nru, Glottolog language code: yong1288) are two minority languages of China. The documents in the dataset have a transcription in the endangered language. Some of the documents have translations into French, English, and Chinese.
+## Dataset Structure
+### Data Instances
+A typical data row includes the path, audio, sentence, document type and several translations (depending on the sub-corpus).
+`
+{
+  "path": "cocoon-db3cf0e1-30bb-3225-b012-019252bb4f4d_C1/Tone_BodyPartsOfAnimals_12_F4_2008_withEGG_069.wav",
+  "audio": "{'path': 'na/cocoon-db3cf0e1-30bb-3225-b012-019252bb4f4d_C1/Tone_BodyPartsOfAnimals_12_F4_2008_withEGG_069.wav', 'array': array([0.00018311, 0.00015259, 0.00021362, ..., 0.00030518, 0.00030518, 0.00054932], dtype=float32), 'sampling_rate': 16000}",
+  "sentence": "ʈʂʰɯ˧ | ɖɤ˧mi˧-ɬi˧pi˩ ɲi˩",
+  "doctype": "WORDLIST",
+  "translation:zh": "狐狸的耳朵",
+  "translation:fr": "oreilles de renard",
+  "translation:en": "fox's ears",
+}
+`
+### Data Fields
+path: the path to the audio file;;
+audio: a dictionary containing the path to the audio file, the audio array and the sampling rate;
+sentence: the sentence the native has pronunced;
+doctype: the document type (a text or a word list);
+translation:XX: the translation of the sentence in the language XX.
+### Data Splits
+The train, test and validation splits have all been reviewed and were splitted randomly (ratio 8:1:1) at sentence level (after the extraction from various files).
+## Dataset Creation
+### Curation Rationale
+[Needs More Information]
+### Source Data
+#### Initial Data Collection and Normalization
+[Needs More Information]
+#### Who are the source language producers?
+[Needs More Information]
+### Annotations
+#### Annotation process
+[Needs More Information]
+#### Who are the annotators?
+[Needs More Information]
+### Personal and Sensitive Information
+[Needs More Information]
+## Considerations for Using the Data
+### Social Impact of Dataset
+The dataset was collected in immersion fieldwork for language documentation. It contributes to the documentation and study of the world's languages by providing documents of connected, spontaneous speech recorded in their cultural context and transcribed in consultation with native speakers. The impacts concern research, and society at large: a guiding principle of the Pangloss Collection, which hosts the data sets, is that a close association between documentation and research is highly profitable to both. A range of possibilities for uses exist, for the scientific and speaker communities and for the general public.
+### Discussion of Biases
+The corpora are single-speaker and hence clearly do not reflect the sociolinguistic and dialectal diversity of the languages. No claim is made that the language variety described constitutes a 'standard'.
+### Other Known Limitations
+The translations are entirely hand-made by experts working on these languages; the amount and type of translations available varies from document to document, as not all documents have translations and not all translated documents have the same translation languages (Chinese, French, English...).
+## Additional Information
+### Dataset Curators
+[Needs More Information]
+### Licensing Information
+[Needs More Information]
+### Citation Information
+[Needs More Information]

pangloss.py ADDED Viewed

	@@ -0,0 +1,200 @@

+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Pangloss datasets for Yongning Na (yong1288) and Japhug (japh1234)"""
+import csv
+import json
+import os
+import datasets
+from datasets.tasks import AutomaticSpeechRecognition
+_CITATION = {
+    "yong1288": """
+@misc{michaud_alexis_2021_5336698,
+author       = {Michaud, Alexis and
+                Galliot, Benjamin and
+                Guillaume, Séverine},
+title        = {{Yongning Na for Natural Language Processing: a
+                single-speaker audio corpus with transcriptions}},
+month        = aug,
+year         = 2021,
+publisher    = {Zenodo},
+version      = {1.0},
+doi          = {10.5281/zenodo.5336698},
+url          = {https://doi.org/10.5281/zenodo.5336698}
+    }
+    """,
+    "japh1234": """\
+@misc{jacques_guillaume_2021_5521112,
+author       = {Jacques, Guillaume and
+                Galliot, Benjamin and
+                Guillaume, Séverine},
+title        = {{Japhug for Natural Language Processing: a single-
+                speaker audio corpus with transcriptions}},
+month        = sep,
+year         = 2021,
+publisher    = {Zenodo},
+version      = {1.0},
+doi          = {10.5281/zenodo.5521112},
+url          = {https://doi.org/10.5281/zenodo.5521112}
+    }
+"""
+}
+_DESCRIPTION = """\
+These datasets are extracts from the Pangloss collection and have
+been preprocessed for ASR experiments in Na and Japhug.
+"""
+_HOMEPAGE = "https://pangloss.cnrs.fr/"
+_LICENSE = "https://creativecommons.org/licenses/by-nc-sa/4.0/fr/legalcode"
+# The HuggingFace Datasets library doesn't host the datasets but only points to the original files.
+# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
+_VERSION = datasets.Version("1.0.0")
+_LANGUAGES = {
+    "yong1288": {
+        "url": "https://mycore.core-cloud.net/index.php/s/vaGMeRf4Iij8MWR/download",
+        "homepage": "https://zenodo.org/record/5336698",
+        "description": "Yongning Na dataset",
+        "translations": ["fr", "en", "zh"]
+    },
+    "japh1234": {
+        "url": "https://mycore.core-cloud.net/index.php/s/kuQCxmyVcUFWroV/download",
+        "homepage": "https://zenodo.org/record/5521112",
+        "description": "Japhug dataset",
+        "translations": ["fr", "zh"]
+    }
+}
+# TODO: Name of the dataset usually match the script name with CamelCase instead of snake_case
+class PanglossDataset(datasets.GeneratorBasedBuilder):
+    """The Pangloss datasets are extracts from Pangloss Collections that can be used for ASR experiments in these languages."""
+    field_translations = {
+        "chemin_audio": "path",
+        "nature": "doctype",
+        "forme": "sentence",
+        "traduction:fr": "translation:fr",
+        "traduction:en": "translation:en",
+        "traduction:zh": "translation:zh"
+    }
+    # This is an example of a dataset with multiple configurations.
+    # If you don't want/need to define several sub-sets in your dataset,
+    # just remove the BUILDER_CONFIG_CLASS and the BUILDER_CONFIGS attributes.
+    # If you need to make complex sub-parts in the datasets with configurable options
+    # You can create your own builder configuration class to store attribute, inheriting from datasets.BuilderConfig
+    # BUILDER_CONFIG_CLASS = MyBuilderConfig
+    # You will be able to load one or the other configurations in the following list with
+    # data = datasets.load_dataset('my_dataset', 'first_domain')
+    # data = datasets.load_dataset('my_dataset', 'second_domain')
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(name=language_name, version=_VERSION, description=language_data["description"])
+        for language_name, language_data in _LANGUAGES.items()
+    ]
+    #DEFAULT_CONFIG_NAME = "na"  # It's not mandatory to have a default configuration. Just use one if it make sense.
+    def _info(self):
+        # TODO: This method specifies the datasets.DatasetInfo object which contains informations and typings for the dataset
+        features = datasets.Features(
+            {
+                "path": datasets.Value("string"),
+                "audio": datasets.features.Audio(sampling_rate=16_000),
+                "sentence": datasets.Value("string"),
+                "doctype": datasets.Value("string"),
+                **{f"translation:{language_code}": datasets.Value("string") for language_code in _LANGUAGES[self.config.name]["translations"]}
+            }
+        )
+        return datasets.DatasetInfo(
+            # This is the description that will appear on the datasets page.
+            description=_DESCRIPTION,
+            # This defines the different columns of the dataset and their types
+            features=features,  # Here we define them above because they are different between the two configurations
+            # If there's a common (input, target) tuple from the features, uncomment supervised_keys line below and
+            # specify them. They'll be used if as_supervised=True in builder.as_dataset.
+            # supervised_keys=("sentence", "label"),
+            # Homepage of the dataset for documentation
+            homepage=_HOMEPAGE,
+            # License for the dataset if available
+            license=_LICENSE,
+            # Citation for the dataset
+            citation=_CITATION,
+            task_templates=[AutomaticSpeechRecognition(audio_column="audio", transcription_column="forme")],
+        )
+    def _split_generators(self, dl_manager):
+        # TODO: This method is tasked with downloading/extracting the data and defining the splits depending on the configuration
+        # If several configurations are possible (listed in BUILDER_CONFIGS), the configuration selected by the user is in self.config.name
+        # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLS
+        # It can accept any type or nested list/dict and will give back the same structure with the url replaced with path to local files.
+        # By default the archives will be extracted and a path to a cached folder where they are extracted is returned instead of the archive
+        urls = _LANGUAGES[self.config.name]["url"]
+        data_dir = dl_manager.download_and_extract(urls)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, self.config.name, "train.csv"),
+                    "split": "train"
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, self.config.name, "test.csv"),
+                    "split": "test"
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                # These kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, self.config.name, "validation.csv"),
+                    "split": "validation"
+                },
+            ),
+        ]
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+    def _generate_examples(self, filepath, split):
+        # TODO: This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
+        # The `key` is for legacy reasons (tfds) and is not important in itself, but must be unique for each example.
+        with open(filepath, encoding="utf-8") as file_descriptor:
+            reader = csv.DictReader(file_descriptor)
+            for key, row in enumerate(reader):
+                translated_fieldnames = [self.field_translations[fieldname] for fieldname in reader.fieldnames if fieldname in self.field_translations.keys()]
+                data = dict(zip(translated_fieldnames, row.values()))
+                data["audio"] = os.path.join(os.path.dirname(filepath), data["path"])
+                # Yields examples as (key, example) tuples
+                yield key, data
+if __name__ == "__main__":
+    # for language in _LANGUAGES.keys():
+    datasets.load_dataset("pangloss.py", "japh1234")
+# datasets-cli test datasets/pangloss --save_infos --all_configs
+# datasets-cli dummy_data datasets/pangloss --auto_generate

test.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from datasets import load_dataset
2	+
3	+ dataset = load_dataset("Lacito/pangloss", "japh1234")