The LeRobot Dataset format

A dataset in LeRobotDataset format is very simple to use. It can be loaded from a repository on the Hugging Face hub or a local folder simply with e.g. dataset = LeRobotDataset("lerobot/aloha_static_coffee") and can be indexed into like any Hugging Face and PyTorch dataset. For instance dataset[0] will retrieve a single temporal frame from the dataset containing observation(s) and an action as PyTorch tensors ready to be fed to a model.

A specificity of LeRobotDataset is that, rather than retrieving a single frame by its index, users can retrieve several frames based on their temporal relationship with the indexed frame, by setting delta_timestamps to a list of relative times with respect to the indexed frame. For example, with delta_timestamps = {"observation.image": [-1, -0.5, -0.2, 0]} one can retrieve, for a given index, 4 frames: 3 “previous” frames 1 second, 0.5 seconds, and 0.2 seconds before the indexed frame, and the indexed frame itself (corresponding to the 0 entry). See example 1_load_lerobot_dataset.py for more details on delta_timestamps.

Under the hood, the LeRobotDataset format makes use of several ways to serialize data which can be useful to understand if you plan to work more closely with this format. The LeRobot team tried to make a flexible yet simple dataset format that would cover most type of features and specificities present in reinforcement learning and robotics, in simulation and in real-world, with a focus on cameras and robot states but easily extended to other types of sensory inputs as long as they can be represented by a tensor.

Here are the important details and internal structure organization of a typical LeRobotDataset instantiated with dataset = LeRobotDataset("lerobot/aloha_static_coffee"). The exact features will change from dataset to dataset but not the main aspects:

dataset attributes:
  ├ hf_dataset: a Hugging Face dataset (backed by Arrow/parquet). Typical features example:
  │  ├ observation.images.cam_high (VideoFrame):
  │  │   VideoFrame = {'path': path to a mp4 video, 'timestamp' (float32): timestamp in the video}
  │  ├ observation.state (list of float32): position of an arm joints (for instance)
  │  ... (more observations)
  │  ├ action (list of float32): goal position of an arm joints (for instance)
  │  ├ episode_index (int64): index of the episode for this sample
  │  ├ frame_index (int64): index of the frame for this sample in the episode ; starts at 0 for each episode
  │  ├ timestamp (float32): timestamp in the episode
  │  ├ next.done (bool): indicates the end of en episode ; True for the last frame in each episode
  │  └ index (int64): general index in the whole dataset
  ├ episode_data_index: contains 2 tensors with the start and end indices of each episode
  │  ├ from (1D int64 tensor): first frame index for each episode — shape (num episodes,) starts with 0
  │  └ to: (1D int64 tensor): last frame index for each episode — shape (num episodes,)
  ├ stats: a dictionary of statistics (max, mean, min, std) for each feature in the dataset, for instance
  │  ├ observation.images.cam_high: {'max': tensor with same number of dimensions (e.g. `(c, 1, 1)` for images, `(c,)` for states), etc.}
  │  ...
  ├ info: a dictionary of metadata on the dataset
  │  ├ codebase_version (str): this is to keep track of the codebase version the dataset was created with
  │  ├ fps (float): frame per second the dataset is recorded/synchronized to
  │  ├ video (bool): indicates if frames are encoded in mp4 video files to save space or stored as png files
  │  └ encoding (dict): if video, this documents the main options that were used with ffmpeg to encode the videos
  ├ videos_dir (Path): where the mp4 videos or png images are stored/accessed
  └ camera_keys (list of string): the keys to access camera features in the item returned by the dataset (e.g. `["observation.images.cam_high", ...]`)

A LeRobotDataset is serialised using several widespread file formats for each of its parts, namely:

  • hf_dataset stored using Hugging Face datasets library serialization to parquet
  • videos are stored in mp4 format to save space
  • metadata are stored in plain json/jsonl files

Dataset can be uploaded/downloaded from the HuggingFace hub seamlessly. To work on a local dataset, you can specify its location with the root argument if it’s not in the default ~/.cache/huggingface/lerobot location.

How to visualize a LeRobotDataset using the HuggingFace Visualize Dataset space

HuggingFace provides a Visualize Dataset space that uses rerun.io to display the camera streams, robot states and actions in a convenient way.

What’s next?

Next, record your own dataset and use it to train a policy!

For more information about LeRobot, checkout the LeRobot Github repository