Chapter 2 Introduction

2.1 Open Definition

The Open Knwoledge Foundation proposed an open definition by setting out principles that define “openness” in relation to data and content. It can be summed up with the following statement : “Open means anyone can freely accesss, use, modify, and share for any purpose”. We can find the full open definition here.

We can summarize the most important elements of “openness” with:

  • The availability and access
  • The re-use and redistribution
  • The Universal participation

Interoperability remains very important consideration in open data and contens. It denotes th ability of different systems and organizations to work together (inter-operate).

2.2 Open-source applications

We can find applications of open-source movement in various domains:

  • Software development: A descentralized software development model based on open collaboration. The source code of these softwares is publicly available and they are under an open-source licnese. Examples: Mozilla Firefox, Hadoop, Linux, Git, Elasticsearch, Spark, Docker, MongoDB…
  • Government: Giving the citizens access to governments’ documents by developing and sharing open data plateforms
  • Knowledge and contents: It is knowledge that everyone is free to use, reuse, and redistribute without legal, social or technological restriction. It ensures that content remains free for re-use, that source documents are available to interested persons, and that changes to content are accepted. Examples: wikipedia, Project Gutenberg (A digital library of free e-books), Wikisource (an online digital library of free-content textual sources on a wiki, operated by the Wikimedia Foundation), Openstreetmap….
  • Science: It consists of making scientific research (publications, data, tools…) acessible to public. Examples: The Dataverse Network Project, arXiv (open-access repository of scientific papers)…
  • Education: A shared web space as a platform to improve upon learning, organizational, and management challenges. Examples: Openclassoom, wikiversity…

2.3 Open data

2.3.1 Open data in institutional services

Open data refers to making data freely available to everyone to use and republish without restrictions. These data are provided in different way: via APIs, by downloading datasets and databases… Nowadays we notice the increasing number of plateforms and services sharing open data in different domains and scales. Here is a list of some examples of these services:

  • International organizations: The World Bank’s Open Data initiative provides all users with open access to World Bank data, the world Helath Observatory provides indicators and statistcs via free databases, earth data provided from NASA…
  • Goverment data: Several govrnments developed web platefroms for exploring and accessing open data related to economy, agriculture, society, culture… For examples: French data gov, US data gov
  • European Orgnisations: EUropean commission invested on the The European Union Open data portal, Eurostat

2.3.2 Open data for machine learning

The increasing development of machine learning models participated to the explosion of labeled data ready to use for training machine learning and deep learning models for different tasks: image classification, object detection, natural language processing (sentitment analysis, translation…), speech recognition… We can find various plateforms providing access to these datasets including the descritpion and scripts used for developing and training existing models:

In the folowing sections, we list some famous open datasets used for developing machine learning models.

2.3.2.1 Image data

Tasks Name Description Link
Character recognition MNIST It’s a dataset of handwritten digits and contains a training set of 60,000 examples and a test set of 10,000 examples. MNIST
Object detection and segmentation COCO COCO is a large-scale object detection, segmentation, and captioning dataset (330K images, 80 object categories) COCO
Image classification, object detection ImageNet It is a large visual database designed for use in computer vision. It contains more than 15 million images hand-annotated by indicating main present objects and bounding boxes. ImageNet
Image classification, object detection Open Images dataset Open Images is a dataset of almost 9 million URLs for images. These images have been annotated with image-level labels bounding boxes spanning thousands of classes. The dataset contains a training set of 9,011,219 images, a validation set of 41,260 images and a test set of 125,436 images. Open Images dataset
Visual Question Answering VisualQA VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. VisualQA
Object detection The Street View House Numbers (SVHN) Dataset It contains more than 600,000 labeled images of house numbers obtained from Google Street View SVHN
Image classification CIFAR The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. CIFAR
Image classification Fashion-MNIST Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST
Face recognition Faces in the Wild It consists of 30,281 faces collected from News Photographs. Faces in the Wild
Classification and object detection SpaceNet SpaceNet is a corpus of commercial satellite imagery and labeled training data to use for machine learning research. SpaceNet
Video classification YouTube-8M YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities YouTube-8M

2.3.2.2 Text data

Tasks Name Description Link
NLP WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms WordNet
Sentiment analysis IMDB datset A set of 25,000 highly polar movie reviews for training, and 25,000 for testing. IMDB datset
NLP Yelp Reviews It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas. Yelp Reviews
NLP The Wikipedia Corpus This dataset is a collection of a the full text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. The Wikipedia Corpus
Translation Machine Translation of Various Languages This dataset consists of training data for four European languages. The task here is to improve the current translation methods. Machine Translation of Various Languages
Question answering SQuAD Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD

2.3.2.3 Audio/speech data

Tasks Name Description Link
Speech recognition Free Spoken Digit Dataset A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz Free Spoken Digit Dataset
Music classification Free Music Archive (FMA) The dataset consists of full-length and HQ audio, pre-computed features, and track and user-level metadata. Free Music Archive (FMA)
Music analysis Million Song Dataset The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Million Song Dataset
Speech recognition LibriSpeech LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech LibriSpeech
Speech recognition VoxCeleb VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from YouTube videos. VoxCeleb

2.4 References