DTMF (Dual Tone Modulated Frequency) aka touch tone, was initially designed to be a faster method of dialling since make-and-break dial pulses were slow and a more efficient method for user input was required switching was becoming digital.
By using two tones DTMF tones, switching equipment could be easily identify the input without complex circuitry, and because it uses two tones the chances of someone accidentally generating the two-tone pair was slim. MF had been used for tandem / trunk signalling inside the network with great success, so DTMF was a standout choice.
SIP was never explicitly designed as a telephony protocol, and as such, it’s support for DTMF wasn’t baked in from the start.
Over time organisations started using DTMF so users could interact with IVRs, Auto Attendants, enter PIN codes and interact with services using their telephone, ideas that wen’t beyond the call setup function originally imagined for DTMF.
Your standard subscriber loop POTS line doesn’t have any out of band signalling for the DTMF, but the carrier switch passes through the audio end to end, and the DTMF tones are carried in that audio, so it’s not a problem.
So when SIP rolled along as the defacto standard for Voice calls over IP, it didn’t have a method for signalling that a DTMF digit had been placed.
Never to fear, neither does a POTS line, so everything will be fine and the tones will just be carried in the media stream like they do on a POTS line.
This was called in-band DTMF. In-band because the DTMF tones are carried in the audio stream like they would if you were to playback those tones on a tape recorder or harmonised whistling.
However along came G.729 and other compressed codecs and suddenly these two tones were lost in compression, so the VoIP world needed a new way to transport DTMF information.
RFC2833 came to fix this problem in 2000, introducing a special RTP packet called an “RTP Event” that denoted a DTMF key-press, which evolved into RFC4733, carrying the DTMF as an RTP event.
Here’s a post I did on RFC2833 DTMF.
For some reason this method of DTMF signalling is still referred to as RFC2833, despite the fact that most implementations are of RFC4733.
But the next problem facing SIP implementers was SIP Proxies had no awareness of the DTMF events, because by definition, a SIP proxy only works with the SIP (signalling) part of the call, not the RTP (media).
So for a device to know when a DTMF keypress happened it’d have to be listening in to the RTP media stream to pickup the RTP events.
The solution that’s considered best practices today actually predates the other two standards. RFC2976 describes using SIP INFO messages to carry payloads. (Link to post on the topic)
In the case of using SIP INFO for payloads, the DTMF info is put into this payload, so this is often used now to carry DTMF info as well as ISUP messaging.
Seems like backwards step, but Proxies can be aware of DTMF messaging and interoperability is in theory enhanced.
The disadvantage is there’s now 3 possible implimentations, DTMF Inband, DTMF in RTP Events, and DTMF in SIP INFO.
Some endpoints use more than one method, some even use all 3. The idea being that it’ll “just work” and won’t need configuring. So when a user presses a digit it plays the tone (in-band), sends an RTP event (RFC4733/2833) and sends a SIP INFO message containing the pressed digit (RFC2976) all at once.
This can cause huge headaches if the switch it’s talking to can recognise more than one type of DTMF signalling it gets multiple inputs, causing jumping through IVRs and menus.
If only we had one universal standard…