Claude Code · 2026-05-15 · unknown branch
Session Summary: Investigate feed-forward vs per-scene reconstruction open question
Main outcome: Investigated the inference-cost-vs-fidelity open question. Read the existing question doc, enumerated the LRM-family and SDS-on-splats families, and used `ask` for a synthesis pass. The researcher then pointed out that Instant3D had been dropped without reading, so the agent re-fetched it and re-ranked the feed-forward family with a discussion of whether the diffusion sampler counts as per-scene optimisation. The researcher then demanded a single-architecture ablation rather than a cross-paper benchmark; agent designed a concrete LGM-as-warm-start ablation with three arms (frozen feed-forward, warm-start + N-step fine-tune sweep, per-scene from scratch) on a fixed 25-scene set.
High-Level Summary
Investigated the inference-cost-vs-fidelity open question. Read the existing question doc, enumerated the LRM-family and SDS-on-splats families, and used `ask` for a synthesis pass. The researcher then pointed out that Instant3D had been dropped without reading, so the agent re-fetched it and re-ranked the feed-forward family with a discussion of whether the diffusion sampler counts as per-scene optimisation. The researcher then demanded a single-architecture ablation rather than a cross-paper benchmark; agent designed a concrete LGM-as-warm-start ablation with three arms (frozen feed-forward, warm-start + N-step fine-tune sweep, per-scene from scratch) on a fixed 25-scene set.
Timeline & size
Key decisions
None recorded.
Files touched
None recorded.
Commands run
None recorded.
Tools used
- ask
- lint_report
- node_context
- raw_source
- search_nodes
- wiki_page
Turn-by-turn conversation
Redacted user/assistant transcript turns, with assistant tool use collapsed under its response.
We have an open question about feed-forward vs per-scene reconstruction. What does the corpus say?
thinking: Read the existing question doc, then enumerate the LRM-family papers, then use the ask tool for a synthesis pass.
Let me read the open-question doc first, then enumerate the feed-forward family, then run a synthesis question through ask.
Tool use (8)
json{
"source_path": "data/research/questions/inference-cost-vs-fidelity.md"
}json{
"limit": 8,
"query": "Large Reconstruction Model feed-forward",
"type": "Paper"
}json{
"node_id": "Paper:arxiv-2311-04400"
}json{
"node_id": "Paper:arxiv-2402-05054"
}json{
"node_id": "Paper:arxiv-2403-02151"
}json{
"limit": 5,
"query": "DreamGaussian SDS on splats",
"type": "Paper"
}json{
"backend": "auto",
"question": "On the same input image, how does feed-forward 3D reconstruction (LRM/LGM/TripoSR) compare to per-scene SDS-on-splats (DreamGaussian) in fidelity, geometric accuracy, and wall-clock cost?",
"top_k": 8
}json{}thinking: No lint report yet. Move on. The answer plus the existing question doc is enough to propose an experiment design.
Lint report doesn't exist yet for this build. Here is what I have:
What the corpus already says
- The open question doc (
questions/inference-cost-vs-fidelity.md) calls out the four-axis gap: fidelity / geometric accuracy / wall-clock / GPU-hour amortised cost. No paper provides all four. - The feed-forward family is LRM (arxiv-2311-04400) -> Instant3D (arxiv-2311-06214) -> LGM (arxiv-2402-05054) -> TripoSR (arxiv-2403-02151). All three I read fit in ~0.5-5s/asset.
- The per-scene SDS-on-splats family is DreamGaussian (arxiv-2309-16653) and AGG (arxiv-2401-04099, which is itself an amortised hybrid). DreamGaussian is ~2 minutes/asset.
New evidence the ask synthesis pulled together
- Feed-forward methods are bandwidth-limited on thin structures because Objaverse under-represents them.
- Per-scene SDS hallucinates back-side geometry where the critic has weak supervision.
- The break-even at >10k assets favours feed-forward; below that, DreamGaussian wins.
Proposed experiment to resolve the question (extending the existing doc): Inputs: 100 single images sampled stratified across (synthetic / GSO / in-the-wild) and (thin / blocky / organic) categories. Methods: LRM, LGM, TripoSR (feed-forward) and DreamGaussian, AGG (per-scene/amortised). Metrics: (a) Held-out-view PSNR / SSIM / LPIPS on the 50 GSO captures (ground truth available). (b) Chamfer + F-score against GSO meshes. (c) Wall-clock inference on a single A100, single asset. (d) Amortised GPU-hours per asset across 100, 1k, 10k, 100k asset volumes. Stratify (a)+(b) by category to expose the thin-structure failure mode the synthesis predicts.
If any of {a, b, c, d} is missing the comparison can be made (this corpus has roughly enough data to estimate three out of four). I'd extend the open-question doc with this design rather than rewrite it.
You searched for the LRM family and the search returned Instant3D too, but you never opened it. Pull it in. Where does it actually sit on the spectrum — is it really feed-forward, or is the diffusion stage hiding a per-scene optimisation in disguise?
thinking: Right — I have Instant3D in the search results from turn 5 and silently dropped it because I assumed it was a text-to-3D paper not a single-image reconstructor. But the user's question is sharper: Instant3D's pipeline is multi-view-diffusion -> LRM-style reconstructor. The diffusion stage is a forward sampler; the reconstructor is feed-forward. So end-to-end it's two feed-forward passes but the diffusion sampler internally iterates. Whether that counts as 'feed-forward' depends on whether you count diffusion sampling as optimisation. Let me read the body.
Fair catch — I dropped it on a wrong assumption. Reading it now.
Tool use (2)
json{
"node_id": "Paper:arxiv-2311-06214"
}json{
"limit": 12,
"node_id": "Paper:arxiv-2311-06214"
}thinking: Instant3D belongs in the feed-forward column unambiguously. The 25-step DDIM is a sampler not per-scene optimisation against a photometric loss. LGM actually extends Instant3D (uses its 4-view diffusion output as input). Re-ranking: LRM (single-image -> triplane) -> Instant3D (text -> 4-view diffusion -> LRM-style triplane) -> LGM (uses Instant3D-style 4-view -> Gaussian field). All three are feed-forward. TripoSR is single-image -> triplane, same lineage as LRM. AGG sits between the two camps as an amortised hybrid.
Re-ranked family:
Pure feed-forward (single forward pass over network, no photometric loss at inference time):
- LRM (arxiv-2311-04400) - single image -> triplane in ~5s.
- TripoSR (arxiv-2403-02151) - LRM at production quality, ~0.5s.
- Instant3D (arxiv-2311-06214) - text -> 4-view diffusion -> LRM-style triplane, ~20s end-to-end. The diffusion sampler IS iterative (25 DDIM steps) but it's not optimising a scene-specific loss; it's sampling a generative prior. That's the distinction.
- LGM (arxiv-2402-05054) - uses Instant3D's 4-view output as input, emits a Gaussian field at higher resolution in ~5s.
Amortised hybrid:
- AGG (arxiv-2401-04099) - amortised single-image Gaussian generation; sits between feed-forward (training-time only) and per-scene (lightweight fine-tune at inference).
Per-scene SDS-on-splats:
- DreamGaussian (arxiv-2309-16653) - SDS loop on 3D Gaussians, ~2 min/asset.
Your distinction-question lands cleanly: Instant3D's diffusion stage is iterative but is not per-scene optimisation, because the iteration is over a fixed generative prior, not a scene-specific photometric residual. The asset cost flips at higher complexity though — if you have to re-run the diffusion sampler for every prompt, the amortised cost is closer to per-scene than the LRM-only numbers suggest.
OK. Now design the actual experiment that would answer this question. Not 'benchmark the five papers against each other' — they all use different training sets. I want a single ablation where the only changed variable is feed-forward vs per-scene on the same model class. What does that look like?
thinking: The cleanest ablation isolates the training procedure, not the architecture. LRM and DreamGaussian have different architectures (transformer triplane vs Gaussian-soup SDS), different priors, different training sets — comparing them measures everything but the variable in question. The real ablation is: take ONE architecture (say, a Gaussian-field decoder) and run it three ways on the same scene set: (a) pure feed-forward with frozen weights, (b) feed-forward warm-start followed by N iterations of per-scene fine-tuning on the test view photometric loss, (c) per-scene from scratch (random init or SfM-init Gaussian soup). Sweep N. The crossover point is the answer.
Single-architecture ablation. Anything else conflates the training distribution, the architecture, and the procedure.
Tool use (1)
json{
"source_path": "data/research/questions/inference-cost-vs-fidelity.md"
}thinking: Concrete proposal. Use LGM as the fixed architecture (it's a Gaussian-field decoder, has public weights, and accepts 4-view input which is the natural input format). Three arms: (A) frozen LGM forward pass only, (B) LGM forward pass + N iterations of per-scene photometric optimisation of the predicted Gaussians, sweeping N in {0, 50, 200, 1000, 5000} steps, (C) per-scene from SfM-init Gaussian soup with no LGM warm-start, run to convergence (~30k steps for 3DGS recipe). Same test scenes (say 20 GSO captures with held-out views + Mip-NeRF 360 indoor for unbounded). Measure held-out PSNR/SSIM/LPIPS as a function of N. Plot the curve. The N at which arm B catches arm C tells you the warm-start value of feed-forward; the asymptote of B vs the value of A tells you how much per-scene fine-tuning adds; the slope between is the cost-fidelity exchange rate.
Concrete design — extending the open-question doc:
Fixed architecture. Use LGM (arxiv-2402-05054) — Gaussian-field decoder, public weights, accepts 4-view input. Same network in all three arms.
Same scene set. 20 GSO captures (ground-truth mesh available for Chamfer) + 5 unbounded indoor scenes from Mip-NeRF 360. ~25 scenes. All evaluated on the same held-out viewpoints.
Three arms, only the procedure varies:
- Arm A: pure feed-forward. Run LGM once, freeze, render held-out views.
- Arm B: warm-start + per-scene fine-tune. Run LGM once to get initial Gaussian set, then optimise photometric loss on the input views for N iterations. Sweep N in {0, 50, 200, 1000, 5000}. The N=0 setting recovers Arm A.
- Arm C: per-scene from scratch. Discard LGM. SfM-seed a Gaussian soup as in original 3DGS. Optimise to convergence (~30k steps).
Single metric of interest. Held-out-view PSNR as a function of total wall-clock seconds (arm A: ~5s, arm B: 5s + N*~0.04s, arm C: ~40 min). Plot the Pareto frontier.
What the curve answers.
- The N at which Arm B's PSNR equals Arm C's converged PSNR = the warm-start value of feed-forward, in 'iterations saved'.
- The gap between Arm A and Arm C at convergence = the irreducible feed-forward fidelity penalty for this architecture class.
- The slope of Arm B between N=200 and N=1000 = the marginal cost-fidelity exchange rate at the working point.
- Stratify by scene type (object-centric GSO vs unbounded Mip-NeRF 360) to expose the in-distribution / out-of-distribution gap LRM-family methods are known to have.
What this doesn't answer. Cross-architecture comparison (LGM vs LRM vs TripoSR vs DreamGaussian). The ablation deliberately fixes the architecture to isolate the training-procedure variable. The cross-architecture question is a separate experiment.
I'd add this as a 'proposed resolution' section to questions/inference-cost-vs-fidelity.md.
Subagent sessions
No subagent transcripts attached.
Redacted preview
Investigated the inference-cost-vs-fidelity open question. Read the existing question doc, enumerated the LRM-family and SDS-on-splats families, and used `ask` for a synthesis pass. The researcher then pointed out that Instant3D had been dropped without reading, so the agent re-fetched it and re-ranked the feed-forward family with a discussion of whether the diffusion sampler counts as per-scene optimisation. The researcher then demanded a single-architecture ablation rather than a cross-paper benchmark; agent designed a concrete LGM-as-warm-start ablation with three arms (frozen feed-forward, warm-start + N-step fine-tune sweep, per-scene from scratch) on a fixed 25-scene set.