Hafnia Specialized VLMs: Post-Training NVIDIA Cosmos Reason for Smart-City Video Intelligence

Article

October 31, 2025

Milestone Systems’ Hafnia brings two custom vision language models (VLM) that are specialized for smart city video intelligence.

NVIDIA Cosmos Reason is an open VLM that excels at physical understanding for autonomous vehicles (AV), robotics, smart city, and industrial applications. For smart city applications, a general purpose model has limitations because each city is unique. A general model may not be able to understand city-specific visuals, languages, symbols, events, weather, lighting and more.

To overcome this challenge, Milestone Hafnia used ~150.000 hours of video (~75.000 hours from the US / ~75.000 hours from Europe) from its data library and applied supervised fine-tuning (SFT) to post-train NVIDIA Cosmos Reason and create two specialized models: a VLM contextually trained for US and one for EU.

In this post, we walk through the process of baselining Cosmos Reason performance and evaluating the new specialized models performance on smart-city camera streams. This includes understanding of traffic flow directions, time of day, weather, and road conditions, as well as their use in applications such as validating alerts, generating reports and summarizing key events.

The Case for Customization

Figure 1: Autonomous vehicles ego views vs smart-city cameras views

Post-training goals

Post-training Cosmos Reason was focused on improving understanding in 4 primary areas:

Table 1: VLM Domain Gaps and Their Effect on VLM Behavior

Data Curation at Scale

Milestone Hafnia curated 150.000 hours of ~30s second clips as training data with NVIDIA Cosmos Curator. The US data is stored on AWS and model training was completed on Nebius. Similarly, EU data is stored and models are trained in the EU on Nebius to comply with EU sovereignty.

Workflow (NVIDIA Cosmos Curator + Hafnia)

The AI model data curation workflow begins with ingesting and filtering raw data. This is followed by within-city balanced sampling to ensure a representative dataset. Next, the data is processed by annotating concise rationales and final answers. To maintain data quality, schema validators are applied for alerts, reports, and summaries. Finally, hard-negative mining is performed to further refine the dataset.

SFT shards (same pool)

Visual/domain coverage: flows (merge/lane/ped), markings, condition diversity (time-of-day, weather, road state).

Contract focus: instruction-oriented prompts to stabilize output formats.

Privacy

Brighter AI anonymization, full lineage, GDPR/CCPA; EU data stays in EU.

Figure 2: Hafnia & Nvidia Cosmos Curation Pipelines

Table 2: Breakdown of curated training dataset

Post-Training Plan & Infrastructure

Chosen approach: two-step supervised fine-tuning (SFT)

We used visual/domain post-training and output-format adaptation in two passes. This reduced complexity, kept learning stable, and produced consistent formatting while improving flow reasoning.

Using Cosmos Reason SFT code, different configurations were applied, focusing more either on the visual resolution, temporal resolution or combinations until finding the sweet spot for each domain adaptation.

SFT dataset was pulled in different shapes and forms, from general question and answer (summary) formats to specific tasks prompting and answering and instruction datasets to reinforce the output formatting.

Example of representative configuration

Ingest: ~30 s clips, 2 fps (training).
Context: model_max_length=12288.
Optimizer: AdamW.
Batching: train_batch_per_replica=6, mini_batch=1, 1 epoch.
Dtypes/parallelism: master=float32, param=bfloat16, FSDP defaults (no offload).
Output schema targets: AlertValidation, MonitoringSummary, IncidentReport.

Infrastructure & stability

We ran full SFT on 8×H100, ~50–60 experiments per model, 7-8 seconds per step on the training loop.

Figure 3: Hafnia & Nvidia Post-Training & Testing Flow

Prompts & Output Contracts

Next, we adjusted the configuration of the different annotation shards, i.e. the different combinations of prompts and answers/captions included with the clips on the different tuning experiments.

The goal was making outputs machine-consumable for ops systems (alert manager, reporting), while preserving concise reasoning. We standardized prompts and enforced format-bound answers to bring in frame-aligned timestamps.

Prompt patterns(*):

Alert validation (parent–child)
- System: “You are a CCTV traffic analyst. Validate upstream alerts. Think briefly, then output in JSON format.”
- User:
  - Camera: <camera_id>
  - Time window: <t0>–<t1> (fps=2)
  - Upstream alert: <type> (confidence=<p>)
  - Task: Confirm or discard. If discard, say why. If confirm, cite evidence frames.
  - Format: JSON only. No extra text.
- Assistant format:
  - <think>short rationale about flow/occlusion/conditions</think>
  - <answer>{"decision":"confirm|discard", "reason":"...", "evidence_frames":[...], "start_ts":"...", "end_ts":"...", "camera_id":"..."}</answer>

Scheduled monitoring (hourly/daily/custom)
- User:
  - Camera: <camera_id>
  - Time window: <t0>–<t1>
  - Task: Summarize weather/road conditions and notable events.
  - Format: JSON only.

Answer JSON fields:
- camera_id, time_range, conditions{weather,light,visibility,road_state}, events[], notes.

Report automation (incident drafts)
- User:
  - Camera: <camera_id>
  - Incident: <accident/stalled_vehicle/...>
  - Time window: <t0>–<t1>
  - Task: Produce operator report draft.
  - Format: JSON only.

Answer JSON fields:
- incident_type, timeline{start,end,key_moments[]}, parties[], location_hint, cause_hypothesis, evidence_frames[].

(*) In this examples we simplify to JSON format, but the specific format was defined.

Output contracts (schemas)

AlertValidation v1

{

"camera_id": "string",

"time_range": {"start": "ISO-8601", "end": "ISO-8601"},

"decision": "confirm|discard",

"reason": "string",

"condition_tags": ["night","rain","snow","clear","glare"],

"evidence_frames": [int],

"latency_ms": "number"

}

MonitoringSummary v1

{

"camera_id": "string",

"time_range": {"start":"ISO-8601","end":"ISO-8601"},

"conditions": {"weather":"clear|rain|snow|fog","light":"day|night|dawn","visibility":"good|fair|poor","road_state":"dry|wet|snow"},

"events": [{"type":"string","count":int}],

"notes": "string"

}

IncidentReport v1

{

"camera_id":"string",

"incident_type":"accident|stalled_vehicle|intrusion|other",

"timeline":{"start":"ISO-8601","end":"ISO-8601","key_moments":[{"t":"ISO-8601","desc":"string"}]},

"parties":[{"type":"vehicle|pedestrian|cyclist","count":int}],

"location_hint":"string",

"cause_hypothesis":"string",

"evidence_frames":[int]

}

Format enforcement during SFT.

Prompting: always require “specific-schema only; no extra text”; keep <think> brief to avoid style overfit – as mentioned before, this is a simplification, the actual format was defined in the prompts.
Validators in loop: formats parsing, required-field checks, enumerations validation, timestamp alignment (±N frames).

Operator terminology. Normalize entities (e.g., stalled_vehicle, merge_left, crosswalk_occupied) and map to specific taxonomies.

The solution worked and delivered simple output formats and concise rationales. Results yield parseable, consistent outputs without sacrificing reasoning quality.

Results:

We evaluated both the baseline NVIDIA Cosmos Reason and our SFT model on a representative sample of smart city cameras, including diverse scenes and challenging edge cases common in this context (dusk glare, heavy rain, occlusions, complex merges). As mentioned before, the baseline model is already a top performer on the “Physical Reasoning from Video” leaderboard on Hugging Face, so the goal was to test whether domain-specific fine-tuning could push performance beyond an already strong baseline, or show no material gain.

Here’s a summary of our post-training results

Impressive baseline performance with no domain supervision. Cosmos Reason understanding was particularly notable given the influence from ego-view (AV/robotics) applications.
Largest gains come from visual domain transfer. The +19.4% jump in traffic flow/direction understanding and +8.9% in condition identification indicate that post-training on city-specific visuals (within-city balance across scenes, lighting, and road states) substantially improves visual understanding from static viewpoints.
Strong functional improvements. Alert verification accuracy rose +4.4% which supports real-world triage of alerts and reduction in false positives. Formatting already had high baseline performance, and the fine-tuned model boosted validity and completeness and order by ~+0.6% on average. These improvements provide a meaningful impact on downstream automation tasks that depend on specific schemas.

In summary, beginning with a robust Physical-AI VLM and applying two-step SFT: first for general visual adaptation, then for output-focused objectives; significantly improves CCTV tasks like traffic analysis, condition assessment, and generating clean schemas.

Conclusion

With SFT two-step process, the specialized US VLM and EU VLM reliably handle flows, condition classification, and output-format alignment for operations.

Next, we plan to continually refresh content with Curator and improve the model performance, add guardrails, Q&A chat windows and explore model distillation for edge deployments.

Integrate this new VLM with Milestone or apply the specialized VLM on existing sensor-based data - contact us for Hafnia VLM-as-a-Service access.