Recording & Rendering
& Rendering 101 --- Acoustics vs.
--- Subjective evaluation ---
Recording what we hear - A progression
If a person or brain B1 observes and perceives a tree falling in the forest (B) and this sensation is to be transmitted to a second person B2 in a different location and at a different time, then several pieces of information have to be recorded. They would include the ear drum signals of B1 and the body motion V in response to the sound wave and any structure borne vibration. The ear drum signals would have to be applied to the ear drums of B2, and the motional signals from B1 would have to impart the same motion to B2. There are inherent errors in this transmission. For example, the outer ear of B1 does not match the outer ear of B2, but the brain of B2 is only experienced to process sounds with its own ears. Also the details in neural wiring of brain B2 are likely to be sufficiently different from B1 so that an exact duplication of perception is ultimately not possible.
Regardless of the seeming futility we try the next best approach which is binaural recording and playback. Here the person B1 is replaced by a dummy head with torso D. Microphones replace the ear drums. The outer ears and the shape of the head are an average of human forms as are the surface textures. The recorded dummy head signals should be applied to the ear drums of a listener B3. This is not without problems.
Binaural sound reproduction can be tonally and spatially very realistic except for localization in the frontal hemisphere. There it suffers from in-head localization. The soundstage is usually not perceived as being outside and in front of the head. I have been told that out-of-head localization can be learned, but have not spent enough time to find out if that also works for me. The in-head soundstage follows any head movement rather than being stationary. This provides a completely unnatural cue to the brain. It can be avoided by tracking the movement of the head and adjusting each ear signal according to the head's position relative to the soundstage. Video game consoles sometimes use this technique and in combination with a visual image they can give a realistic spatial rendering.
So the simplified transmission system (C) above can reproduce a certain number of cues sufficiently close to create a fairly good illusion of a real event, but lacks in frontal localization and in body related tactile inputs compared to (B). If the dummy head recording is played back over loudspeakers in a reflection free room (G), then two new problems become apparent. The frequency spectrum of the recording had been modified by the external ear of the dummy head D and is modified again by the external ear of the listener B3. This causes a sound coloration that can be avoided by applying the inverse of the head-related-transfer-function for a given loudspeaker and listener setup to the microphone signals of D. In addition the signal from the left loudspeaker impinges upon the left ear and the right ear. Similarly the right speaker sends signals to both right and left ears. This cross-talk between the ears can be cancelled for a precisely fixed loudspeaker and listener setup. Compensating signals are added electronically to left and right loudspeaker signals so that LL cancels at the left ear the contribution from RL. Correspondingly RR compensates for LR. Alternatively a wall can be placed between the speakers that extends forward to the head of B3. It physically blocks the crosstalk signals LR and RL. Both solutions confine the listener's head to a small region for the cancellation to be effective and they do work well under anechoic playback conditions. The required setup conditions are hardly met in typical living rooms (H) where a multiplicity of loudspeaker sound reflections easily destroys the acoustic balancing act. While the brain tries to compensates, it becomes eventually tiring to listen.
That emphasis has continued to prejudice and influence recording techniques to this day. Therefore it is common practice to use a multitude of microphones to highlight individual performers or instrument groups. Recordings are done in studios for full control over the reverberant sound. Lost in the process is any sense of a coherent acoustic space, of the venue acoustics, in the recording. Instead, the space in which the sound occurs is often chopped up into isolated lumps.
Typical loudspeakers in typical room setups are not capable of reproducing a full spatial impression even if the cues for it are imbedded in the recording. This is due to the room reflections which in turn are a function of the polar response of the loudspeakers and their placement in the room. Thus what has been done to the spatial aspect in the recording process goes largely unnoticed during playback. Even the recording/mixing/mastering engineer was probably not aware of the consequences of his decisions because the typical monitor loudspeakers are not up to the task of telling him. Loudspeakers must be either full range omni-directional or dipolar, and be placed away from reflecting surfaces, and be placed symmetrical to the room boundaries. In that configuration the brain can disassociate the room reflected sound from the loudspeaker direct sound and fully use the spatial cues in the direct sound to form spatial impressions and localization of phantom sources.
The microphone setup (K) consists of two main microphones LM and RM and two ambient microphones LA and RA. The main microphones are super-cardioids (Schoeps MK 41) for well controlled off-axis frequency response. They are separated by D1 to duplicate the acoustic path length between the ears. They are angled to cover the width of the sound source while being placed further from it than is usual. The overall aim is to record an audience perspective of the source. The ambient microphones are omni-directional (Schoeps MK 2S). They are placed sufficiently behind the main microphones, D2, and apart from each other, D3, to be decorrelated from each other. Typical dimensions could be D1 = 8 inch, a = 110 degrees, D2 = 30 feet, D3 = 40 feet.
Recording and Environment
All acoustical events take place in an environment. Those environments can be very different, like the relatively open space of the falling tree, the large and closed space of a concert hall, a restaurant or your bath room. We perceive the specifics of the environments from the multitude of reflections of air particles that occur when they strike boundaries in their path of propagation. Reflections are always delayed relative to the time that air particle motion takes to propagate directly from source to receiver. Thus any acoustic event that is perceived as sound has at least two elements to it: the direct signal and the reflections from the environment. Our brain is totally used to process that mix of information and we may become aware of the two elements by paying attention.
The 4-microphone setup captures the direct and reflected signal spectra with temporal separation by placing the microphones into different parts of the acoustic space. Thus, when played back over two loudspeakers, stronger cues are presented to the brain about the acoustic source and its environment than a two microphone setup could provide, which captures both elements simultaneously and in difficult to control proportions. The ratio of direct to ambient reflected signals is very different for a musician in the orchestra, the conductor in front of the orchestra or a member of the audience behind the conductor in the concert hall. Underlying any recording is a decision as to which perspective to present to the loudspeaker listener. Is it the musician's perspective, the conductor's, the audience's or none of the above and instead some artificial perspective that serves a particular purpose? The 4-microphone setup appears to be well suited to capture the audience perspective and might be called the "D+R Stereo" technique for capturing direct and reflected sounds separately.
Sound stream segregation
In natural hearing the evolutionary and adaptive processor between the ears is capable of segregating a direct sound stream from a reflected sound stream and from a structure borne vibration. We are able to perceive the direction from which sound is coming, the distance of the source and even its size primarily from the direct sound. The reflected sound can enhance or detract from this information and we can perceptually remove it, if it is not relevant to the source information. This works in a wide range of acoustic environments.
For example, think of having a conversation with another person at a cocktail party. At the same time there are many conversations going on around you. This can make it difficult to understand your partner. The sound stream that is coming from her to your left and right ears tends to fuse with the many sound streams that are happening around you and are also impinging on your ears. You may step closer to her to increase the volume of this sound stream relative to the other streams which then become to you the noisy background for your conversation. On the other hand you may be also interested in the conversation between X and Y over in the corner of the room. As you focus in your mind on that conversation you pick up segments of that specific sound stream and make sense of them, and all this while you are having a conversation here. Your brain is multi-tasking.
I can say from personal experience that this is a difficult process if you are not intimately familiar with the language that is spoken around you. My native language is German and it took me years of exposure to English to be able to process sound in the acquired language as easily as in my native language. Certainly, language cognition is one element of the "cocktail party effect". Other elements are the timing differences between left and right ear signals due to the direction from which a sound stream arrives, the envelope modulation depth of the X-Y sound stream relative to the background sound streams or noise, and the timing of the X-Y stream segments relative to your conversation. It is difficult to hear while you are talking.
The venue in which the cocktail party takes place also has an effect on the ease of conversation. If it is a large hall with highly reflective surfaces and long reverberation time, then distant conversations lose envelope modulation depth and are difficult to understand even though the volume level in the hall may not be that high. The long reverberation fuses distant streams. If the room is small and there are many people, then the modulation depth of a more distant sound stream becomes low even when its volume is high and thus fewer segments are heard and recognized.
When a recording is made, the microphones capture the direct sound stream from the sound sources and the reflected sound stream from the recording venue.
When the recording is played back in a room, then the listener is exposed again to two sound streams, the direct sound from the two loudspeakers and the room reflected sound. Given that the loudspeakers have uniform directivity, we apparently can perceptually segregate the two streams to a large degree. This was recognized in the ORION and PLUTO comparison.
Imbedded in the direct sound from the two loudspeakers is the direct sound stream that the microphones received and the recording venue's reflected stream. Thus during playback the processor between the ears is asked to deal with four streams of acoustic information: the direct loudspeaker sound and its reflection in the listening room, and the direct microphone signal and its reflection in the recording venue.
The proposed four microphone technique captures direct and reflected sound streams in the recording venue with time and intensity separation between them. Mixing the two streams in optimal proportion before playback should give the listener's brain stronger cues for constructing an illusion about the original sound sources in their acoustic space, similar to what a person in the audience in that space would hear live. But, just as language familiarity is helpful in the cocktail party effect, so is familiarity with live acoustic events to recognize and appreciate spatial characteristics in a recording. Most recording techniques aim for clarity first. Spatiality is secondary and often synthesized which is readily recognized over ORION or PLUTO.
With four microphones both clarity and
spatiality should be captured. This is not an attempt at surround sound. The
sound stage in the listening room will always be behind the loudspeakers and in
that sense it is a spatial distortion relative to the recording situation. It is
an easily accepted distortion because of familiar elements in it.
Related reading material: