Milestone Systems’ Hafnia brings two custom vision language models (VLM) that are specialized for smart city video intelligence.
Cosmos Reason is an open VLM that excels at physical understanding for autonomous vehicles (AV), robotics, smart city, and industrial applications. For smart city applications, a general purpose model has limitations because each city is unique. A general model may not be able to understand city-specific visuals, languages, symbols, events, weather, lighting and more.
To overcome this challenge, Milestone Hafnia used ~150.000 hours of video (~75.000 hours from the US / ~75.000 hours from Europe) from its data library and applied supervised fine-tuning (SFT) to post-train NVIDIA Cosmos Reason and create two specialized models: a US VLM and an EU VLM.
In this post, we walk through the process of baselining Cosmos Reason performance and evaluating the new specialized models performance on smart-city camera views. This includes understanding of traffic flow directions, time of day, weather, and road-surface conditions, as well as their use in applications such as validating alerts, generating reports and summarizing key events.
Figure 1: Autonomous vehicles ego views vs smart-city cameras views
Post-training Cosmos Reason was focused on improving understanding in 4 primary areas:
Table 1: VLM Domain Gaps and Their Effect on VLM Behavior
Milestone Hafnia curated 150.000 hours of ~30s second clips as training data with NVIDIA Cosmos Curator. The US data is stored on AWS and model training was completed on Nebius. Similarly, EU data is stored and models are trained in the EU on Nebius to comply with EU sovereignty.
Workflow (NVIDIA Cosmos Curator + Hafnia)
The AI model data curation workflow begins with ingesting and filtering raw data. This is followed by within-city balanced sampling to ensure a representative dataset. Next, the data is processed by annotating concise rationales and final answers. To maintain data quality, schema validators are applied for alerts, reports, and summaries. Finally, hard-negative mining is performed to further refine the dataset.
SFT shards (same pool)
- Visual/domain coverage: flows (merge/lane/ped), markings, condition diversity (time-of-day, weather, road state).
- Contract focus: instruction-oriented prompts to stabilize output formats.
Privacy
Brighter AI anonymization, full lineage, GDPR/CCPA; EU data stays in EU.
Figure 2: Hafnia & Nvidia Cosmos Curation Pipelines
Table 2: Breakdown of curated training dataset
Chosen approach: two-step SFT
We used visual/domain post-training and output-format adaptation in two passes. This reduced complexity, kept learning stable, and produced consistent formatting while improving flow reasoning.
Using Cosmos Reason SFT code, different configurations were applied, focusing more either on the visual resolution, temporal resolution or combinations until finding the sweet spot for each domain adaptation.
SFT dataset was pulled in different shapes and forms, from general question and answer (summary) formats to specific tasks prompting and answering and instruction datasets to reinforce the output formatting.
Example of representative configuration
- Ingest: ~30 s clips, 2 fps (training).
- Context: model_max_length=12288.
- Optimizer: AdamW.
- Batching: train_batch_per_replica=6, mini_batch=1, 1 epoch.
- Dtypes/parallelism: master=float32, param=bfloat16, FSDP defaults (no offload).
- Output schema targets: AlertValidation, MonitoringSummary, IncidentReport.
Infrastructure & stability
We ran full supervised FT on 8×H100, ~50–60 experiments per model, 7-8 seconds per step on the training loop.
Figure 3: Hafnia & Nvidia Post-Training & Testing Flow
Next, we adjusted the configuration of the different annotation shards, i.e. the different combinations of prompts and answers/captions included with the clips on the different tuning experiments.
The goal was making outputs machine-consumable for ops systems (alert manager, reporting), while preserving concise reasoning. We standardized prompts and enforced format-bound answers to bring in frame-aligned timestamps.
Prompt patterns(*):
- 
- System: “You are a CCTV traffic analyst. Validate upstream alerts. Think briefly, then output in JSON format.”
- User: 
- Assistant format: Scheduled monitoring (hourly/daily/custom) 
- 
- User:Alert validation (parent–child)
 - Answer JSON fields: camera_id, time_range, conditions{weather,light,visibility,road_state}, events[], notes.
 - Report automation (incident drafts)
- User: 
- Answer JSON fields: incident_type, timeline{start,end,key_moments[]}, parties[], location_hint, cause_hypothesis, evidence_frames[]
 
- User: 
 
- User:Alert validation (parent–child)
 
(*) In this examples we simplify to JSON format, but the specific format was defined.
- AlertValidation v1 
- MonitoringSummary v1
- IncidentReport v1
- Prompting: always require “specific-schema only; no extra text”; keep <think> brief to avoid style overfit – as mentioned before, this is a simplification, the actual format was defined in the prompts.
- Validators in loop: formats parsing, required-field checks, enumerations validation, timestamp alignment (±N frames).
Operator terminology. Normalize entities (e.g., stalled_vehicle, merge_left, crosswalk_occupied) and map to specific taxonomies.
The solution worked and delivered simple output formats and concise rationales. Results yield parseable, consistent outputs without sacrificing reasoning quality.
We evaluated both the baseline NVIDIA Cosmos Reason and our SFT model on a representative sample of smart city cameras, including diverse scenes and challenging edge cases common in this context (dusk glare, heavy rain, occlusions, complex merges). As mentioned before, the baseline model is already a top performer on the “Physical Reasoning from Video” leaderboard on Hugging Face, so the goal was to test whether domain-specific fine-tuning could push performance beyond an already strong baseline, or show no material gain.
Here’s a summary of our post-training results
- Impressive baseline performance with no domain supervision. Not surpirsing, Cosmos Reason understanding was particularly notable given the influence from ego-view (AV/robotics) applications.
- Largest gains come from visual domain transfer. The +19.4% jump in traffic flow/direction understanding and +8.9% in condition identification indicate that post-training on city-specific visuals (within-city balance across scenes, lighting, and road states) substantially improves visual understanding from static viewpoints.
- Strong functional improvements. Alert verification accuracy rose +4.4% which supports real-world triage of alerts and reduction in false positives. Formatting already had high baseline performance, and the fine-tuned model boosted validity and completeness and order by ~+0.6% on average. These improvements provide a meaningful impact on downstream automation tasks that depend on specific schemas.
In summary, beginning with a robust Physical-AI VLM and applying two-step SFT: first for general visual adaptation, then for output-focused objectives; significantly improves CCTV tasks like traffic analysis, condition assessment, and generating clean schemas.
Conclusion
With SFT two-step process, the specialized US VLM and EU VLM reliably handle flows, condition classification, and output-format alignment for operations.
Next, we plan to continually refresh content with Curator and improve the model performance, add guardrails, Q&A chat windows and explore model distillation for edge deployments.
Integrate this new VLM with XProtect or apply the specialized VLM on your own feeds - contact us for Hafnia VLM-as-a-Service access.
We are inviting 100 developers and companies to participate in the Hafnia Hackathon 2025 in collaboration with Milestone Systems and NVIDIA, to build integrations with other applications using Hafnia's Vision Language Model. From real-time analytics to smart city solutions and beyond.
Finalists’ submissions will be presented the Milestone Dev Summit in Copenhagen (Nov 10–11), and winners will be announced!
Find out more here about the hackathon and fantastic prizes here: hafnia.milestonesys.com/
hackathon
Authors: Fulgencio Navarro (FUNA), Danilo Dresen (DAND), Shriram Arunachalam Muthuvezhappan (SARU) and Edward Mauser (EDM)
 
                                             
                                            