← Collections

Hugging Face

FIFA World Cup Games

A full-match FIFA World Cup video metadata dataset with 40 YouTube matches linked to structured match statistics, lineups, event timelines, formations, venues, and prediction polls.

June 9, 2026 DatasetFIFA World CupFootballSportsLong-Video UnderstandingMultimodal GitHub Hugging Face

FIFA World Cup Games: A Full-Match Football Video Dataset

FIFA World Cup Games is an open-source dataset for long-form football video understanding. It provides 40 full-match YouTube references from classic FIFA World Cup games and links each match to structured football annotations, including match metadata, team statistics, lineups, formations, text event timelines, venue information, referees, attendance, and pre-match prediction polls.

Instead of redistributing video files, the dataset stores YouTube video IDs and URLs so users can independently retrieve videos when their use case, copyright constraints, YouTube’s Terms of Service, and local research policies allow it. This makes the dataset a structured bridge between full-match football broadcasts, match-level statistics, event narratives, and tournament context.

Why FIFA World Cup Games Exists

Full-match football broadcasts are long, tactical, and highly contextual. A single World Cup match can include more than 90 minutes of continuous play, substitutions, cards, goals, tactical shape changes, injury time, extra time, penalty shootouts, and shifting momentum.

FIFA World Cup Games narrows this challenge into a compact but structured dataset where each retained match is linked to match-level evidence. This setting is useful because football broadcasts naturally contain rich multimodal signals:

By connecting full-match videos with structured football annotations, FIFA World Cup Games supports research in long-video reasoning, sports analytics, multimodal retrieval, temporal prediction, match question answering, and football agent systems.

Dataset Overview

FIFA World Cup Games includes a top-level meta.jsonl file and one folder for each structured match. Each match folder contains metadata, statistics, event timelines, lineups, and prediction polls.

Key Statistics

StatisticValue
Indexed YouTube video records40
Structured match folders40
World Cup editions covered7 editions, from 1998 France to 2022 Qatar
Unique national teams25
Total event timeline rows3,543
Event rows per match4 to 465, average 88.6
Stages coveredGroup stage, 8th finals, quarter-finals, semi-finals, final
Statistics fields observed22 team-level metric names
Video distributionVideo files are not included; YouTube IDs and URLs are provided

Edition Coverage

EditionIndexed videosStructured matches
2022 Qatar88
2018 Russia1010
2014 Brazil88
2010 South Africa33
2006 Germany55
2002 Korea/Japan44
1998 France22

Stage Coverage

StageStructured matches
Group stage20
8th finals8
Quarter-finals5
Semi-finals2
Final5

meta.jsonl

meta.jsonl is the top-level video and source-link index. Each line describes one retained YouTube match.

FieldTypeDescription
video_idstringYouTube video ID
urlstringYouTube watch URL
titlestringOriginal or lightly cleaned YouTube video title
teamslist[string]Parsed teams in the match
world_cupstringWorld Cup edition, such as 2022 Qatar
match_info_urlstringLinked Soccer365 match page

A representative metadata entry looks like this:

{
  "video_id": "HxBqMbI5kqQ",
  "url": "https://www.youtube.com/watch?v=HxBqMbI5kqQ",
  "title": "FULL MATCH: Brazil v Croatia | Quarter-Finals | FIFA WORLD CUP QATAR 2022",
  "teams": ["Brazil", "Croatia"],
  "world_cup": "2022 Qatar",
  "match_info_url": "https://soccer365.net/games/15292867/"
}

Dataset Structure

Each structured match is organized under games/. Folder names use a stable human-readable convention based on edition year, tournament stage, and teams.

FIFA_World_Cup_Games/
├── meta.jsonl
├── images/
│   └── poster.jpeg
├── games/
│   ├── 2022-final-Argentina-vs-France/
│   │   ├── metadata.json
│   │   ├── stats.json
│   │   ├── events.json
│   │   ├── lineups.json
│   │   └── prediction.json
│   ├── 2018-group-stage-Germany-vs-Mexico/
│   │   ├── metadata.json
│   │   ├── stats.json
│   │   ├── events.json
│   │   ├── lineups.json
│   │   └── prediction.json
│   └── ...
└── poster_assets/
    └── thumbnails/

The folder naming convention is:

YYYY-stage-TeamA-vs-TeamB

For example:

2022-final-Argentina-vs-France
2014-semi-finals-Brazil-vs-Germany
2018-group-stage-Germany-vs-Mexico

The repository does not include downloaded video files. Users can use the video_id or url field to retrieve videos independently when permitted.

Match Metadata

Each metadata.json file links the source YouTube video to normalized match context.

FieldTypeDescription
source_video.idstringYouTube video ID
source_video.urlstringYouTube watch URL
source_video.titlestringCleaned match title
source_video.datestringMatch date in YYYY-MM-DD format when available
match_info.home_teamstringHome/listed first team from the match source
match_info.away_teamstringAway/listed second team from the match source
match_info.stagestringTournament stage
match_info.datetimestringSource datetime string
match_info.stadiumstringStadium name
match_info.locationstringStadium city and country
match_info.temperaturestringSource temperature string when available
match_info.weatherstringWeather text when available
match_info.viewersstringAttendance
match_info.refereeslist[string]Referee crew when available
soccer365_urlstringSource match page

A representative metadata file looks like this:

{
  "source_video": {
    "id": "ORzHdV_NVnQ",
    "url": "https://www.youtube.com/watch?v=ORzHdV_NVnQ",
    "title": "Argentina vs France -- 2022 FIFA World Cup Final",
    "date": "2022-12-18"
  },
  "match_info": {
    "home_team": "Argentina",
    "away_team": "France",
    "stage": "final",
    "datetime": "18.12.2022 23:59",
    "stadium": "Lusail",
    "location": "Lusail, Qatar",
    "temperature": "+35C",
    "viewers": "88,966",
    "referees": []
  },
  "soccer365_url": "https://soccer365.net/games/15292874/"
}

Match Statistics

Each stats.json file stores team-level match statistics. The exact statistic set varies by match because older source pages expose fewer fields than recent matches.

Common statistics include:

StatisticDescription
Expected Goals (xG)Expected goals when available, mainly for recent matches
ShotsTotal shots
Shots on TargetShots on target
SavesGoalkeeper saves
Possession %Team possession percentage
CornersCorner kicks
FoulsFouls committed
OffsidesOffside calls
Yellow CardsYellow cards
Red cardsRed cards
AttacksAttacking sequences from the source page
Dangerous AttacksDangerous attacks from the source page
PassesTotal passes
Pass Accuracy %Pass accuracy percentage
Free KicksFree kicks
Throw-insThrow-ins
CrossesCrosses
TacklesTackles

A representative statistics file looks like this:

{
  "Expected Goals (xG)": {
    "Argentina": "3.3",
    "France": "2.2"
  },
  "Shots": {
    "Argentina": "21",
    "France": "10"
  },
  "Shots on Target": {
    "Argentina": "9",
    "France": "5"
  }
}

Event Timelines

Each events.json file stores a source-derived text timeline. Rows are ordered chronologically when the source exposes a detailed report. Older games may only include a compact list of major match events.

FieldTypeDescription
minutestringMatch minute or source marker
typestringSource event type code, such as whistle, goal, subst, yc, or rc
descriptionstringHuman-readable event text

A representative event sequence looks like this:

[
  {
    "minute": "-",
    "type": "whistle",
    "description": "The referee starts the match"
  },
  {
    "minute": "1",
    "type": "",
    "description": "Mexico kick-off, and the game is underway."
  },
  {
    "minute": "7",
    "type": "",
    "description": "Good effort by Mats Hummels as he directs a shot on target, but the keeper saves it"
  }
]

This field should be treated as a text event timeline rather than a frame-accurate official tracking feed.

Lineups and Prediction Polls

Each lineups.json file stores starting players, substitutes, and formations.

FieldTypeDescription
<team>.startinglist[object]Starting XI players
<team>.substituteslist[object]Substitute bench players
numberstringShirt number
namestringPlayer name
formation.<team>stringFormation, such as 4-3-3

Each prediction.json file stores the source page’s pre-match user prediction poll, including vote percentages and vote counts for each team win and draw.

{
  "Argentina win": {
    "percentage": "36",
    "votes": "569"
  },
  "draw": {
    "percentage": "34",
    "votes": "525"
  },
  "France win": {
    "percentage": "30",
    "votes": "472"
  }
}

Construction Pipeline

The dataset is built through a multi-stage crawling and cleaning pipeline that emphasizes source traceability and match-level correctness.

1. Playlist Extraction

The pipeline starts from the YouTube FIFA World Cup full-match playlist PLCGIzmTE4d0jq6wHT2TvSspZ_HLiIx4_y. Video IDs, URLs, titles, and team/year signals are extracted from the playlist.

2. Match Normalization

Team names and World Cup editions are parsed from YouTube titles. Aliases such as Korea Republic are normalized to South Korea, and editions such as 2022 Qatar, 2018 Russia, and 2002 Korea/Japan are mapped into consistent labels.

3. External Match Linking

Each video record is linked to a Soccer365 match page when a confident match page can be identified.

4. Structured Data Crawling

For linked matches, structured match metadata, stadium information, attendance, referees, lineups, formations, team statistics, text event timelines, and prediction poll results are crawled.

5. Folder-Level Packaging

Structured data is stored in one folder per match using stable, human-readable folder names.

6. Final Validation

Each structured match is checked for file coverage. Every retained match folder includes metadata.json, stats.json, events.json, lineups.json, and prediction.json.

Quick Start

Install the basic dependencies:

pip install datasets huggingface_hub pandas yt-dlp

Load the top-level metadata from Hugging Face:

from datasets import load_dataset

repo_id = "choucsan/FIFA_World_Cup_Games"
dataset = load_dataset(repo_id, data_files="meta.jsonl")
records = dataset["train"]

print(records[0])

Read one structured match locally:

import json
from pathlib import Path

game_dir = Path("games/2022-final-Argentina-vs-France")

metadata = json.load(open(game_dir / "metadata.json", encoding="utf-8"))
stats = json.load(open(game_dir / "stats.json", encoding="utf-8"))
events = json.load(open(game_dir / "events.json", encoding="utf-8"))
lineups = json.load(open(game_dir / "lineups.json", encoding="utf-8"))
prediction = json.load(open(game_dir / "prediction.json", encoding="utf-8"))

print(metadata["source_video"]["url"])
print(stats["Shots"])
print(events[:3])
print(lineups["formation"])
print(prediction)

Download a YouTube video independently when permitted:

yt-dlp -f "bv*+ba/b" \
  -o "games/2022-final-Argentina-vs-France/video/%(id)s.%(ext)s" \
  "https://www.youtube.com/watch?v=ORzHdV_NVnQ"

Please ensure that downloading and using videos complies with YouTube’s Terms of Service, copyright rules, and your local research policies.

Applications

FIFA World Cup Games can be used in several research and development settings.

ApplicationHow FIFA World Cup Games Helps
Long video understandingSupports reasoning over 90+ minute football broadcasts, extra time, and penalty shootouts
Event localizationLinks textual event timelines to goals, substitutions, cards, saves, attacks, and match phases
Visual retrievalEnables retrieval of goals, corners, fouls, substitutions, and tactical sequences from text queries
Temporal predictionSupports next-event prediction, momentum modeling, score progression, and pre-match expectation analysis
Multimodal QACombines video evidence, statistics, lineups, formations, event text, venues, and tournament context
Football agentsProvides long-match context for agents that can search matches, answer tactical questions, summarize event flows, and generate scouting-style reports
AI referee researchEnables experiments on foul/card understanding, incident verification, referee-support workflows, and rule-grounded decision assistance
Automated highlightsSupports detection of goals, red cards, late winners, penalty shootouts, comebacks, and iconic World Cup moments
Sports analyticsSupports formation analysis, team-level statistical comparison, edition-level trends, and historical match comparison

Citation and Contact

If FIFA World Cup Games helps your work, please consider linking back to the dataset page. For questions, corrections, or collaboration, contact choucisan@gmail.com.