From frontier model to trusted real world video intelligence: Milestone Hafnia and NVIDIA Cosmos 3

Article

Juni 01, 2026

In smart cities, traffic operations, and transportation hubs, video AI does not just observe — it informs decisions. Which lane of the road to close. Whether to dispatch emergency services. Whether an alert is real or a false positive. These decisions have consequences, and the AI behind them needs to be trustworthy — meaning accurate, consistent, and free of hallucination.

General-purpose Vision Language Models (VLMs) have improved dramatically. But there is a significant gap between a model that can describe a video and one that can reliably support operational decisions. That gap is what Hafnia — and our work with NVIDIA Cosmos — is designed to close.

Milestone Hafnia is an initiative to build the data infrastructure, validation pipelines, and fine-tuning capability needed to turn frontier world models into operational tools. We have fine-tuned NVIDIA Cosmos to create validated, domain-specific VLMs. These specialized VLMs power XProtect Video Summarization, and are accessible to third-party integrations through our VLM-as-a-Service (VLMaaS) solution.

NVIDIA Cosmos 3 is a new world foundation model that combines vision reasoning and multimodal generation across text, video, images, ambient sound and action in a single model to help developers create world data with physical context.

This post shares where Cosmos 3 performs strongly, and where domain-specific fine-tuning using Hafnia’s Data Library closes the gap for customers handling smart cities, traffic, transportation, and safety use cases.

Why operational video needs more than scene description

Describing a video is not the same as reasoning about what is happening, why it matters, and what a system downstream should do about it. In a traffic or smart city environment, a model supporting real operations needs to distinguish:

a vehicle stopped because of congestion and one stopped because of an incident

a pedestrian waiting safely on the sidewalk and one stepping into the road

a legal turn and a risky or illegal maneuver

wet pavement, ice, fog, glare, or other environmental conditions

a real alert and a false positive caused by camera angle, occlusion, or scene complexity

These are not object detection tasks. They require spatial understanding, temporal reasoning, and domain context. A model that gets these wrong is not just imprecise — it is unreliable.

And in environments where video AI informs real decisions, unreliable is not acceptable.

The foundation of Hafnia: real-world data at operational scale

The Hafnia Data Library brings together millions of hours of real-world video contributed through a compliant legal and technical framework. Unlike public datasets, the library captures the edge cases that matter most in production environments.

The dataset reflects operational diversity: camera angles, weather, traffic patterns, geographies, lighting, infrastructure, and the edge cases that public datasets rarely include.

Designed for compliance

All footage is anonymized using Brighter AI and processed under a framework designed to align with GDPR and the EU AI Act, with auditable data lineage. Compliance is not a constraint on what we can do — it is part of what makes the data trustworthy enough to build production AI on top of. Within the pipeline, footage is curated and annotated through a semi-automatic workflow combining automated model pipelines with human-in-the-loop review, supported by NVIDIA Cosmos Curator.

The result is rich, domain-specific annotation: scene interpretation, event classification, structured labels, and the metadata that makes evaluation meaningful. This matters because it means models are validated against the same scenarios Milestone customers and ecosystem partners actually use — not simplified scenes or generic public benchmarks.

How we validate: three levels of real-world readiness

Not all model capability is equally valuable in production. A model that excels at weather classification but hallucinates during incident detection is not ready for deployment. Hafnia's validation framework is designed to assess readiness across three levels, using subsets of the Data Library focused on smart city and traffic scenarios.

Each level reflects a higher bar for operational trustworthiness:

Level one — recognition and classification: weather classification, road condition assessment, traffic state classification, and binary incident detection. The model needs to get the basics right before anything else matters.

Level two — reasoning: traffic risk assessment, interpretation of legal versus risky maneuvers, and understanding interactions between vehicles, pedestrians, and infrastructure. This is where general-purpose models tend to struggle most.

Level three — structured output: accident reports, alert validation, video summaries, and JSON-formatted outputs ready for integration into downstream systems. This is the deployment test: can the model produce outputs that a real system can actually consume?

A model that passes all three levels is not just capable — it is integrable. That means its outputs slot directly into real systems, APIs, and workflows without manual cleanup. That distinction is what separates a promising foundation model from something you can actually deploy.

Measurement combines quantitative metrics (accuracy, F1, hallucination rate, JSON validity, format adherence) with qualitative review (human scoring, LLM-as-a-judge, domain-specific error analysis). Weather classification has a clear ground truth. Accident report generation has to be judged on whether the model identified the right event, sequenced it correctly, grounded its claims in visible evidence, and followed the requested format.

Why NVIDIA Cosmos 3 stands out for operational video

NVIDIA Cosmos 3 was built for physical AI — models that need to understand the real world, not just categorize it. Its architecture is designed for spatial and temporal reasoning, which is precisely what operational video demands. Our early-access validation confirms several areas where this design intent translates directly into production-relevant capability.

Strong spatial and temporal reasoning about traffic context. Cosmos 3 performs well when assessing whether a maneuver or situation is unsafe, and explains its assessment in terms of the specific actors and conditions in the scene — not generic safety language.

Improved hallucination robustness. In high-stakes environments, a model that invents details is worse than no model at all — it generates false confidence. Cosmos 3 shows stronger evidence-grounding than comparable general-purpose models, particularly when prompts are structured to require visible evidence.

Reliable understanding of weather and environmental conditions. The model classifies rain, fog, road wetness, and other visible conditions reliably — a foundational requirement for any traffic or smart city deployment.

Accurate structured output. Cosmos 3 follows requested JSON and structured schemas closely, even for complex outputs. This is critical for integration into Hafnia VLM-as-a-Service, where model outputs feed directly into alerts, dashboards, and downstream workflows — and need to be machine-readable, not just human-readable.

Where domain fine-tuning improves performance

No frontier model arrives ready for every operational context. Validation is not about finding fault — it is about identifying exactly where domain-specific fine-tuning with Hafnia data library creates the most meaningful improvement. In this early phase, three areas stood out.

Relative object position. Distinguishing whether a pedestrian is on the sidewalk, entering the road, or at a crosswalk is subtle — but it is precisely the kind of distinction that determines whether an alert fires or not. This is a natural target for fine-tuning with Hafnia's annotated pedestrian-interaction data.
Temporal direction. Whether a vehicle is approaching, turning, or changing lanes over time is difficult for models trained on general video. Domain-specific fine-tuning with real traffic footage, annotated for movement trajectories, directly addresses this.
Counting in dense scenes. Heavy traffic, crowded intersections, and transportation hubs push models beyond simple detection. Accurate counting in cluttered, dynamic scenes is one of the highest-value improvements that domain fine-tuning can deliver.

These gaps are the signal. They tell us exactly what to fine-tune, and what customers can trust in production.

From validated model to specialized VLM

The goal of Hafnia is not to evaluate models for its own sake. It is to build a pipeline that continuously turns the best available frontier models into production-ready specialized VLMs — available through our VLM-as-a-Service solution.

As new frontier models emerge, Hafnia validates them against real operational video. The strongest candidates are fine-tuned using Hafnia's curated and annotated datasets. The result is a growing portfolio of specialized VLMs, each adapted to a specific domain: traffic intelligence, smart cities, transportation hubs, safety.

For customers and solution builders, this means AI built for the environments they deploy into. For model developers and ecosystem partners, it means visibility into how a frontier model performs on real operational video, with compliant data pipelines and domain-specific evaluation that generic benchmarks cannot provide.

What comes next

The first VLMs built on Cosmos 3 are coming.

The next phase of evaluation will go deeper into accident reporting, alert confirmation, pedestrian behavior, crowd flow, and transportation hub scenarios. As these evaluations continue, Milestone Hafnia will keep expanding its portfolio of specialized VLMs for the video technology ecosystem.

Follow Hafnia for updates as we release new VLMs and share deeper evaluation findings.

Wir lernen Ihre Sprache noch!

From frontier model to trusted real world video intelligence: Milestone Hafnia and NVIDIA Cosmos 3