Insight

From Single Images to 4D Gaussian Splatting

ML-Sharp turns 2D images into 3D Gaussian Splats; run it on video frames to get 4DGS, and compare with KIRI Engine’s multi-view 3DGS.

onehuang · Jan 23, 2026

TL;DR

ML-Sharp is an open-source tool that converts a single 2D image into a 3D Gaussian Splat. When applied frame-by-frame to video, it can be used to create experimental 4D Gaussian Splatting pipelines. This article explains how that works, how it differs from KIRI Engine’s multi-view 3DGS, and how to prepare video data for this workflow.

Why ML-Sharp Matters

Gaussian Splatting has become one of the most important representations in modern 3D capture, offering high-quality, view-dependent rendering without the overhead of traditional meshes.

ML-Sharp, an open-source project released by Apple, offers a remarkably simple entry point into this world:

Input: a single 2D image
Output: a 3D Gaussian Splat (.ply)
No CUDA, no paid services, no multi-camera capture

Results from SHARP: a single 2D image is converted into a 3D Gaussian Splat and rendered from novel viewpoints, demonstrating monocular 3D Gaussian view synthesis.

SHARP monocular view synthesis results showing how a single 2D image is converted into a 3D Gaussian Splat and rendered from new camera viewpoints, illustrating single-image 3D Gaussian representation.

This makes ML-Sharp especially useful for experimentation, research, and creative exploration—particularly when extended to video.

What ML-Sharp Actually Produces

ML-Sharp does not output a mesh, point cloud, or voxel grid.

Instead, it produces:

A Gaussian Splat (.ply)
Each splat encodes position, scale, orientation, and color
The result behaves like a navigable “3D photo”

The following clip shows how ML-Sharp infers a 3D Gaussian field from a single image and renders a new camera view, compared directly against the original photo, as tested by the KIRI Engine team：

This video compares ML-Sharp’s monocular 3D Gaussian Splat view with the original photo, showing how single-image 3D inference approximates depth and camera movement.

These splats can be viewed and explored immediately in compatible Gaussian Splat viewers such as SuperSplat.

How to Run ML-Sharp

If you would like to try ML-Sharp yourself, you don’t need to piece together instructions from GitHub threads or scattered posts.

Our friend Radiance Fields have already published a clear, step-by-step walkthrough that covers:

Installing ML-Sharp
Running it locally on Apple Silicon
Processing single images
Viewing the resulting Gaussian Splats

Apple ML-Sharp tutorial thumbnail showing how to install and run Sharp to generate 3D Gaussian Splats from a single image on Apple Silicon.

Apple ML-Sharp tutorial showing how a single 2D image is converted into a 3D Gaussian Splat using the Sharp model.

👉 Follow the Radiance Fields ML-Sharp tutorial here

We link to this guide so readers who want to run ML-Sharp locally have a clear, reliable reference, while this article stays focused on how ML-Sharp fits into modern 3D and 4D Gaussian Splatting workflows.

ML-Sharp 3DGS vs KIRI Engine 3DGS

Although both ML-Sharp and KIRI Engine output Gaussian Splats, they are fundamentally different in how those splats are created.

ML-Sharp operates on a single monocular image.

All depth and structure are inferred by AI, based on learned visual cues such as perspective, shading, and occlusion.

KIRI Engine’s 3D Gaussian Splatting is built from real multi-view geometry:

Dozens or hundreds of images
Solved camera poses
Optional LiDAR depth
Photogrammetric and multi-view optimization

The following clip shows a real-world 3D Gaussian Splat created by a KIRI Engine user, demonstrating how multi-view 3DGS can be deployed in physical exhibition spaces.

This video shows a production-quality 3D Gaussian Splat of an Alfa Romeo Formula 1 car created by a KIRI Engine user and displayed in a real exhibition environment.

This means every Gaussian in a KIRI Engine 3DGS is constrained by real physical measurements, not just neural inference.

This comparison highlights why monocular Gaussian splatting is visually impressive but cannot match the geometric stability of true multi-view 3DGS：

Comparison table between ML-Sharp monocular 3D Gaussian Splatting and KIRI Engine multi-view 3DGS, showing differences in depth, camera geometry, and accuracy.

Technical comparison between ML-Sharp’s single-image Gaussian inference and KIRI Engine’s multi-view 3D Gaussian Splatting.

ML-Sharp answers:

“What might this scene look like in 3D?”

KIRI Engine answers:

“Where is everything in 3D space?”

The following clip demonstrates KIRI Engine’s end-to-end 3D Gaussian Splatting pipeline, from real-world mobile capture to live 3DGS preview. Unlike monocular AI inference, this 3D Gaussian Splat is reconstructed from real multi-view camera data, giving it correct scale, stable geometry, and consistent parallax.

This video shows KIRI Engine capturing a real person with a phone (right) and generating a live 3D Gaussian Splat preview of the same scene (left).

This distinction becomes even more important when we introduce time.

From 3D to 4D: Using ML-Sharp on Video

ML-Sharp becomes especially powerful when applied to video using a frame-based pipeline:

Split a video into frames
Run ML-Sharp on each frame → infer a 3D Gaussian field
Align and stack those per-frame splats over time

The following clip shows how ML-Sharp converts a single video into a frame-based 4D Gaussian Splat by inferring a 3D Gaussian field for each frame and stacking them over time.

This video compares the original video (top) with the 4D Gaussian Splat generated by ML-Sharp from sequential frames (bottom), showing monocular 4DGS in action.

This transforms flat video into a temporal sequence of 3D representations, an experimental form of 4D Gaussian Splatting (3D + time).

Before running ML-Sharp, the critical first step is preparing your video as images.

How to Split a Video into Frames

There are three practical ways to prepare video input for ML-Sharp.

Using Video Editing Software

Tools like Final Cut Pro, Premiere Pro, or DaVinci Resolve allow you to:

Import a video
Export it as an image sequence
Choose frame rate (e.g., 24fps, 30fps)
Exporting a video as an image sequence, which is the first step before running ML-Sharp for frame-based 4D Gaussian Splatting.

This is the most precise and professional option.

Using Online Video-to-Frame Tools

Many web services like ezgif let you:

Upload a video
Download a ZIP of extracted frames
No installation required
Ezgif video to image sequence tool used to extract frames from a video for ML-Sharp and 4D Gaussian Splatting.

This is ideal for quick tests and lightweight experiments.

Using Your Phone (Burst Mode)

You can bypass video entirely:

Use burst photography or continuous shooting
Move slightly between shots
Each photo becomes a “frame”

This video shows how a smartphone’s burst mode is used to capture a sequence of images that can be processed by ML-Sharp for frame-based 4D Gaussian Splatting.

This creates a pseudo-video sequence that works surprisingly well for frame-based 4DGS experiments.

Running ML-Sharp on Video Frames (4DGS Workflow)

Running ML-Sharp on Video Frames (4DGS Core Pipeline)

Once a video has been split into frames, each frame can be processed independently by ML-Sharp.

At a high level, the 4D Gaussian Splatting workflow consists of two steps:

Run ML-Sharp (open-source) on each frame → infer a 3D Gaussian field
Align and stack those per-frame 3DGS results over time

The following clip demonstrates how frame-by-frame 3D Gaussian Splats are organized, aligned, and stacked into a time-aware 4D Gaussian Splat：

This video shows how individual ML-Sharp 3D Gaussian Splats (PLY files) generated from video frames are stacked and aligned over time to form a 4D Gaussian Splat.

The first step turns a flat image sequence into a set of time-indexed 3D Gaussian snapshots.

The second step brings them into a shared temporal structure, transforming isolated 3D splats into a coherent 4D representation (3D + time).

This is the simplest way to extend single-image Gaussian inference into video-based 4DGS.

The result is a full 4D Gaussian Splat, a time aware 3D representation reconstructed purely from a single monocular video：

This frame shows the final 4D Gaussian Splat generated by ML-Sharp from a single handheld video, capturing both scene geometry and motion over time.

When to Use ML-Sharp vs When to Use KIRI Engine

Both tools work with Gaussian Splats, but they are designed for very different goals.

Use ML-Sharp when you want:

Rapid experimentation
Working with legacy photos or videos
Lightweight 4DGS prototyping
Learning how Gaussian Splats behave over time

Use KIRI Engine when you need:

True multi-view 3D accuracy
Stable geometry and occlusion
LiDAR-enhanced depth
Production-ready assets for XR, games, VFX, and digital twins

A simple rule of thumb

If you ask: “What might this look like in 3D?” → ML-Sharp

If you ask: “Where is everything in 3D space?” → KIRI Engine

Final Thoughts & A Small Preview

ML-Sharp shows how far monocular AI inference has progressed—but also clearly reveals its limits.

As a free, open-source tool, it is ideal for:

Learning Gaussian Splat fundamentals
Rapid experimentation
Exploring 4D Gaussian Splatting from video

And as a comparison point, it helps explain why true multi-view 3DGS and LiDAR-based capture—like those in KIRI Engine—are still essential for any serious 3D production pipeline.

And here’s the small preview 👀

We’ve been experimenting internally with ML-Sharp on sequential frames, including filtering and segmentation, and the early results look very encouraging.

This video shows a frame-based 4D Gaussian Splat generated from Apple’s ML-Sharp and further optimized by the KIRI Engine team using filtering and segmentation.