First-Place Results on the GIFT-Eval Benchmark

This section documents the evaluation of a foundation model ensemble built using the TimeCopilot library on the GIFT-Eval benchmark.

With less than $30 in compute cost, TimeCopilot achieved first place in probabilistic accuracy (CRPS) among open-source solution on this large-scale benchmark, which spans 24 datasets, 144k+ time series, and 177M data points.

TimeCopilot is an open‑source AI agent for time series forecasting that provides a unified interface to multiple forecasting approaches, from foundation models to classical statistical, machine learning, and deep learning methods, along with built‑in ensemble capabilities for robust and explainable forecasting.

Description

This ensemble leverages TimeCopilot's MedianEnsemble feature, which combines three state-of-the-art foundation models:

Setup

Prerequisites

Clone TimeCopilot's repo and go to experiments/gift-eval.
Python 3.11+
uv package manager
AWS CLI configured (for distributed evaluation)
Modal account (for distributed evaluation)

Installation

# Install dependencies
uv sync

Dataset Management

Download GIFT-Eval Dataset

# Download the complete GIFT-Eval dataset
make download-gift-eval-data

This downloads all 97 dataset configurations to ./data/gift-eval/.

Upload to S3 (Optional)

For distributed evaluation, upload the dataset to S3:

# Upload dataset to S3 for distributed access
make upload-data-to-s3

Evaluation Methods

1. Local Evaluation

Run evaluation on a single dataset locally:

uv run -m src.run_timecopilot \
  --dataset-name "m4_weekly" \
  --term "short" \
  --output-path "./results/timecopilot/" \
  --storage-path "./data/gift-eval"

Parameters:

--dataset-name: GIFT-Eval dataset name (e.g., "m4_weekly", "bizitobs_l2c/H")
--term: Forecasting horizon ("short", "medium", "long")
--output-path: Directory to save evaluation results
--storage-path: Path to GIFT-Eval dataset

2. Distributed Evaluation (Recommended)

Evaluate all 97 dataset configurations in parallel using modal:

# Run distributed evaluation on Modal cloud
uv run modal run --detach -m src.run_modal::main

This creates one GPU job per dataset configuration, significantly reducing evaluation time.

Infrastructure:

GPU: A10G per job
CPU: 8 cores per job
Timeout: 3 hours per job
Storage: S3 bucket for data and results

3. Collect Results

Download and consolidate results from distributed evaluation:

# Download all results from S3 and create consolidated CSV
uv run python -m src.download_results

Results are saved to results/timecopilot/all_results.csv in GIFT-Eval format.

Changelog

2025-11-06

We introduced newer models based on the most recent progress in the field: Chronos-2, TimesFM-2.5 and TiRex.

2025-08-05

GIFT‑Eval recently enhanced its evaluation dashboard with a new flag that identifies models likely affected by data leakage (i.e., having seen parts of the test set during training). While the test set itself hasn’t changed, this new insight helps us better interpret model performance. To keep our results focused on truly unseen data, we’ve excluded any flagged models from this experiment and added the Sundial model to the ensemble. The previous experiment details remain available here.