× WebBrass / Tuba Proto Prototype Core
This prototype translates the organology of brass instruments into a web interface structured around three axes:
Lip + hand complex as interface
Harmonic series + valves as pitch engine
Brassness as a live timbral metric
The scene is designed with the lips centered and cropped, over a sensing grid (pressure, position, emission) and a visual cone along the Z-axis representing the virtual horn uncoiled.
I. Camera Space → Mask Space Mapping
The object-fit: cover Problem
MediaPipe FaceMesh returns landmarks in normalized coordinates [ 0 , 1 ] 2 [0,1]^2 [ 0 , 1 ] 2 relative to the native video stream frame.
The <video> element, however, is rendered using object-fit: cover, which scales the stream to fill the element, symmetrically cropping edges. This introduces a non-trivial discontinuity between stream space and viewport space.
Let:
-( v w , v h ) (v_w, v_h) ( v w , v h ) = native stream resolution in pixels
-( e w , e h ) (e_w, e_h) ( e w , e h ) = video element size (= viewport size)
The cover scale factor is:
s cover = max ( e w v w , e h v h ) s_\text{cover} = \max\left(\frac{e_w}{v_w},\ \frac{e_h}{v_h}\right) s cover = max ( v w e w , v h e h )
The cropped pixels per side are:
c x = v w − e w / s cover 2 , c y = v h − e h / s cover 2 c_x = \frac{v_w - e_w / s_\text{cover}}{2}, \quad c_y = \frac{v_h - e_h / s_\text{cover}}{2} c x = 2 v w − e w / s cover , c y = 2 v h − e h / s cover
Normalized crop fractions:
f x = c x v w = v w − e w / s cover 2 v w , f y = c y v h f_x = \frac{c_x}{v_w} = \frac{v_w - e_w/s_\text{cover}}{2v_w}, \quad f_y = \frac{c_y}{v_h} f x = v w c x = 2 v w v w − e w / s cover , f y = v h c y
The visible stream range in normalized coordinates is:
[ f x , 1 − f x ] × [ f y , 1 − f y ] [f_x,\ 1-f_x] \times [f_y,\ 1-f_y] [ f x , 1 − f x ] × [ f y , 1 − f y ]
with visible width and height:
w vis = 1 − 2 f x , h vis = 1 − 2 f y w_\text{vis} = 1 - 2f_x, \quad h_\text{vis} = 1 - 2f_y w vis = 1 − 2 f x , h vis = 1 − 2 f y
Given a MediaPipe landmark (m_x, m_y) \in [0,1]^2, the final canvas pixel position (p_x, p_y) is computed as follows:
Step 0 — Center Calibration
c x ′ = m x − δ x , c y ′ = m y − δ y c_x' = m_x - \delta_x, \quad c_y' = m_y - \delta_y c x ′ = m x − δ x , c y ′ = m y − δ y
where ( δ x , δ y ) (\delta_x, \delta_y) ( δ x , δ y ) is the calibration offset
Step 1 — Cover Crop Correction
e x = c x ′ − f x w vis , e y = c y ′ − f y h vis e_x = \frac{c_x' - f_x}{w_\text{vis}}, \quad e_y = \frac{c_y' - f_y}{h_\text{vis}} e x = w vis c x ′ − f x , e y = h vis c y ′ − f y
This maps the visible stream region linearly to [0,1].
Step 2 — Horizontal Mirror (scaleX(-1))
μ x = 1 − e x \mu_x = 1 - e_x μ x = 1 − e x
Step 3 — Zoom Around Center
p x = ( 0.5 + ( μ x − 0.5 ) ⋅ z ) ⋅ W p_x = \left(0.5 + (\mu_x - 0.5)\cdot z\right)\cdot W p x = ( 0.5 + ( μ x − 0.5 ) ⋅ z ) ⋅ W
p y = ( 0.5 + ( e y − 0.5 ) ⋅ z ) ⋅ H p_y = \left(0.5 + (e_y - 0.5)\cdot z\right)\cdot H p y = ( 0.5 + ( e y − 0.5 ) ⋅ z ) ⋅ H
where:
z = CSS zoom factor
(W, H) = e w × e h e_w \times e_h e w × e h
Why Work in Normalized Space?
All operations are performed entirely in normalized coordinates.
Previous incorrect formulations mixed pixel-space and normalized-space corrections, effectively applying the crop compensation twice.
The correct approach applies scaling exactly once as a fraction.
II. Lip Model — 1-DOF Oscillator
Physical Basis
The lips in brass playing can be modeled as a damped mass-spring oscillator:
y ¨ + ω l Q l y ˙ + ω l 2 ( y − y 0 ) = F eff m l \ddot{y} + \frac{\omega_l}{Q_l}\dot{y} + \omega_l^2(y - y_0) = \frac{F_\text{eff}}{m_l} y ¨ + Q l ω l y ˙ + ω l 2 ( y − y 0 ) = m l F eff
where:
ω l = 2 π f l \omega_l = 2\pi f_l ω l = 2 π f l
Q l Q_l Q l = quality factor
y 0 y_0 y 0 = resting aperture
F eff F_\text{eff} F eff = net force (air pressure − muscular tension)
Embouchure as Continuous Parameter
Lip geometry yields:
ξ = d 13 , 14 d 61 , 291 \xi = \frac{d_{13,14}}{d_{61,291}} ξ = d 61 , 291 d 13 , 14
ξ ∈ [ 0 , 1 ] \xi \in [0,1] ξ ∈ [ 0 , 1 ]
S m a l l ξ Small \xi S ma ll ξ → tight lips → high register
L a r g e ξ Large \xi L a r g e ξ → open lips → low/pedal register
Lip frequency follows:
f l = f l , 0 T l / T l , 0 f_l = f_{l,0}\sqrt{T_l/T_{l,0}} f l = f l , 0 T l / T l , 0
Tension T l T_l T l is approximated inversely to ξ \xi ξ .
Fractional Harmonic Interpolation
Harmonic series:
f n = n f tube f_n = n f_\text{tube} f n = n f tube
User trains pairs ( ξ k , n k ) . (\xi_k, n_k). ( ξ k , n k ) .
Interpolation:
Sort by ξ k \xi_k ξ k
Find interval
Linear parameter:
t = ξ − ξ k ξ k + 1 − ξ k t = \frac{\xi - \xi_k}{\xi_{k+1} - \xi_k} t = ξ k + 1 − ξ k ξ − ξ k
Smooth S-curve:
t ~ = t 2 ( 3 − 2 t ) \tilde{t} = t^2(3 - 2t) t ~ = t 2 ( 3 − 2 t )
Fractional partial:
n ξ = n k + t ~ ( n k + 1 − n k ) n_\xi = n_k + \tilde{t}(n_{k+1} - n_k) n ξ = n k + t ~ ( n k + 1 − n k )
First-order smoothing:
n ˙ = n ξ − n τ τ = 80 ms \dot{n} = \frac{n_\xi - n}{\tau} \quad \tau = 80\text{ ms} n ˙ = τ n ξ − n τ = 80 ms
Resulting synthesis frequency:
f 0 = f tube ⋅ n f_0 = f_\text{tube} \cdot n f 0 = f tube ⋅ n
This enables continuous glissandi.
III. Additive Synthesis Engine
Audio Graph
oscillators → HP → peak EQ → LP → master → analyser → { dry wet \text{oscillators} \rightarrow \text{HP} \rightarrow \text{peak EQ} \rightarrow \text{LP} \rightarrow \text{master} \rightarrow \text{analyser} \rightarrow \begin{cases} \text{dry}\\ \text{wet} \end{cases} oscillators → HP → peak EQ → LP → master → analyser → { dry wet
Register-Dependent Brassiness
Normalized partial:
p n = clamp ( n / 8 , 0.125 , 1.5 ) p_n = \text{clamp}(n/8,\ 0.125,\ 1.5) p n = clamp ( n /8 , 0.125 , 1.5 )
Brassiness factor:
B p = clamp ( 0.9 e − 2.5 ( p n − 0.125 ) + 0.2 p n 1.8 + 0.15 ) B_p = \text{clamp} \left( 0.9 e^{-2.5(p_n-0.125)} + 0.2 p_n^{1.8} + 0.15 \right) B p = clamp ( 0.9 e − 2.5 ( p n − 0.125 ) + 0.2 p n 1.8 + 0.15 )
Octave Darkening
D oct = clamp ( 1.5 − 0.8 p n , 0.5 , 1.5 ) D_\text{oct} = \text{clamp}(1.5 - 0.8 p_n,\ 0.5,\ 1.5) D oct = clamp ( 1.5 − 0.8 p n , 0.5 , 1.5 )
Lowpass Cutoff
f cut = 280 + P m ⋅ 3200 D oct + B s ⋅ 800 B p + [ boost ] ⋅ 1400 f_\text{cut} = 280 + P_m \cdot 3200 D_\text{oct} + B_s \cdot 800 B_p + [\text{boost}] \cdot 1400 f cut = 280 + P m ⋅ 3200 D oct + B s ⋅ 800 B p + [ boost ] ⋅ 1400
Spectral Roll-Off
α h = 0.3 + 0.5 P m − 0.25 B p \alpha_h = 0.3 + 0.5P_m - 0.25B_p α h = 0.3 + 0.5 P m − 0.25 B p
A h ∝ h − α h A_h \propto h^{-\alpha_h} A h ∝ h − α h
Odd Harmonic Emphasis
A h adj = { A h ( 1 + 0.6 B p ) h odd A h ( 1 − 0.2 B p ) h even A_h^\text{adj} = \begin{cases} A_h (1 + 0.6 B_p) & h \text{ odd} \\ A_h (1 - 0.2 B_p) & h \text{ even} \end{cases} A h adj = { A h ( 1 + 0.6 B p ) A h ( 1 − 0.2 B p ) h odd h even
IV. Cubic-Root Companding for Mic Gate
Motivation
RMS values typically lie in:
[ 0.001 , 0.06 ] [0.001, 0.06] [ 0.001 , 0.06 ]
Linear mapping compresses usable range.
Companding
ρ comp = RMS 1 / 3 \rho_\text{comp} = \text{RMS}^{1/3} ρ comp = RMS 1/3
Intensity normalization:
ρ int = clamp ( 1.8 ρ comp − 0.04 , 0 , 1 ) \rho_\text{int} = \text{clamp}(1.8\rho_\text{comp} - 0.04,\ 0,\ 1) ρ int = clamp ( 1.8 ρ comp − 0.04 , 0 , 1 )
Raw weight:
w raw = clamp ( 0.78 ρ int + 0.22 χ , 0 , 1 ) w_\text{raw} = \text{clamp}(0.78\rho_\text{int} + 0.22\chi,\ 0,\ 1) w raw = clamp ( 0.78 ρ int + 0.22 χ , 0 , 1 )
Adaptive Noise Floor
η k + 1 = η k + 0.025 ( min ( 0.25 , w raw ) − η k ) \eta_{k+1} = \eta_k + 0.025(\min(0.25,w_\text{raw}) - \eta_k) η k + 1 = η k + 0.025 ( min ( 0.25 , w raw ) − η k )
Final Gate
d 0 = clamp ( w raw − η − 0.005 , 0 , 1 ) d_0 = \text{clamp}(w_\text{raw} - \eta - 0.005,\ 0,\ 1) d 0 = clamp ( w raw − η − 0.005 , 0 , 1 )
d = d 0 0.65 d = d_0^{0.65} d = d 0 0.65
User threshold T:
a = d − T 1 − T a = \frac{d - T}{1 - T} a = 1 − T d − T
Exponent:
ε = 0.45 + 0.55 T \varepsilon = 0.45 + 0.55T ε = 0.45 + 0.55 T
g open = a ε g_\text{open} = a^\varepsilon g open = a ε
Soft pre-trigger:
p = d − 0.6 T 0.4 T p = \frac{d - 0.6T}{0.4T} p = 0.4 T d − 0.6 T
g below = 0.18 p 1.4 g_\text{below} = 0.18 p^{1.4} g below = 0.18 p 1.4
V. Mouth Center Calibration
Calibration key [C]:
δ x = x ˉ mouth − 0.5 \delta_x = \bar{x}_\text{mouth} - 0.5 δ x = x ˉ mouth − 0.5
δ y = y ˉ mouth − 0.5 \delta_y = \bar{y}_\text{mouth} - 0.5 δ y = y ˉ mouth − 0.5
CSS shift:
T x = δ x z W T_x = \delta_x z W T x = δ x z W
T y = − δ y z H T_y = -\delta_y z H T y = − δ y zH
VI. Synthetic Impulse Response
IR ( t ) = ( 2 u − 1 ) e − 6 t / τ r + ∑ k a k δ ( t − t k ) \text{IR}(t) = (2u-1)e^{-6t/\tau_r} + \sum_k a_k\delta(t - t_k) IR ( t ) = ( 2 u − 1 ) e − 6 t / τ r + ∑ k a k δ ( t − t k )
Early reflections:
k t_k ms a_k L/R 1 18 0.70 / 0.60 2 32 0.43 / 0.50 3 55 0.35 / 0.35
VII. Mobile Camera & Mic Acquisition
Fallback order:
exact deviceId
→ ideal deviceId
→ facingMode: user
→ video: true
Unified device listener:
navigator.mediaDevices.addEventListener('devicechange', async () => {
await this.refreshCameraInputs();
await this.refreshMicInputs();
});
VIII. Valve Mapping
Effective tube length:
L eff = L 0 ∏ k = 1 4 2 v k s k / 12 L_\text{eff} = L_0 \prod_{k=1}^{4} 2^{v_k s_k/12} L eff = L 0 ∏ k = 1 4 2 v k s k /12
Tube frequency:
f tube = c 2 L eff f_\text{tube} = \frac{c}{2L_\text{eff}} f tube = 2 L eff c
IX. Web Instrument Paradigm
The innovation is not additive synthesis.
The innovation is the control channel:
Camera → Lip Geometry → Fractional Harmonics → Timbral Field
The web browser becomes:
A lip amplifier
A high-dimensional gestural interface
A programmable brass organology
X. Spectral Metrics
Centroid:
C = ∑ f k ∣ X k ∣ 2 ∑ ∣ X k ∣ 2 C = \frac{\sum f_k |X_k|^2}{\sum |X_k|^2} C = ∑ ∣ X k ∣ 2 ∑ f k ∣ X k ∣ 2
Harmonicity:
H = harmonic energy total energy H = \frac{\text{harmonic energy}}{\text{total energy}} H = total energy harmonic energy
Brassiness index:
B s = 0.6 C C ref + 0.4 ( 1 − H ) B_s = 0.6\frac{C}{C_\text{ref}} + 0.4(1-H) B s = 0.6 C ref C + 0.4 ( 1 − H )
XI. Roadmap
Full FaceMesh integration
Calibrated tube-length tables
Instrument switching (tuba / trumpet)
Karplus–Strong mode
Session export (JSON / extended MIDI)