CIFAR10
The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.
The CIFAR-10 Database
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
Here are the classes in the dataset:
- airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck
Download
Python File: CIFAR-10 python(163 MB)
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a Python routine which will open such a file and return a dictionary:
def unpickle(file):
import cPickle
fo = open(file, 'rb')
dict = cPickle.load(fo)
fo.close()
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
- data -- a 10000x3072 numpy array of
uint8s
. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. - labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta
. It too contains a Python dictionary object. It has the following entries:
- label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example,
label_names[0] == "airplane", label_names[1] == "automobile"
, etc.
Download CIFAR-10 Files in Python
#!/usr/bin/env python
#Filename: load_cifar10.py
from urllib import urlretrieve
import cPickle as pickle
import numpy as np
import tarfile
import os
# url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
def unpickle(file):
fo = open(file, 'rb')
dict = pickle.load(fo)
fo.close()
return dict
def get_cifar10_data(file):
data_dict = unpickle(file)
X = data_dict['data'] # X.type: np.ndarray
y = data_dict['labels'] # y.type: list
X = X.reshape(-1, 3, 32, 32).astype('float32')
y = np.array(y).astype('int32')
return X, y
def get_extract_path(url, filepath):
folder = os.path.split(filepath)[0]
extract = 'cifar-10-batches-py'
if not os.path.exists(folder):
os.mkdir(folder)
extract_path = os.path.join(folder, extract)
if not os.path.exists(extract_path):
if not os.path.exists(filepath):
print "Downloading the CIFAR-10 file ...."
urlretrieve(url, filepath) # download .gz file
print "Finish."
TF = tarfile.open(filepath, 'r:gz')
TF.extractall(folder)
return extract_path
def load_cifar10(url):
filename = url.split('/')[-1]
folder = 'data/'
path = get_extract_path(url, folder + filename)
X, y = [], []
for i in range(1,6):
f = os.path.join(path, "data_batch_%d" % i)
X_tmp, y_tmp = get_cifar10_data(f)
X.append(X_tmp)
y.append(y_tmp)
Xtra = np.concatenate(X)
ytra = np.concatenate(y)
f_test = os.path.join(path, "test_batch")
Xtest, ytest = get_cifar10_data(f_test)
return Xtra, ytra, Xtest, ytest
if __name__ == '__main__':
url = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
X_train, y_train, X_test, y_test = load_cifar10(url)
print "X_train:", X_train.shape
print "y_train:", y_train.shape
print "X_test:", X_test.shape
print "y_test:", y_test.shape