Well Android cleverly has formatted this message so that it has special meaning on Android phones, but non-android phones, will just see:
👍 to ” Test 123″
Breaking it down:
200A is Hair Space followed by 200B which is a Zero Width Space (This is metadata for Google’s messaging app) to denote it’s an emoji. This is really clever, as these characters mean it’s an emoji reaction to a previous SMS, but they don’t render as visible characters if the user is on a phone that doesn’t know how to treat it.
Our UTF-16 thumbs up is represented at D83D DC4D (Surrogate), and then the original message is quoted after it.
As a fun aside, this does not work if the original message was near the limit of a single part message, as the original message is quoted, so what happens then?
Here’s a MO message that’s exactly 160 characters:
If I react, the original message (160 chars, GSM7 data) is included, but as we’re including an emoji, it must be UCS2 (essentially UTF16) as the emoji isn’t part of the GSM7 alphabet.
In this case, this ends up being a 3 part message, the original text has almost doubled in size as it’s now in UTF16 – it was 140 bytes before (140 bytes fits 160 GSM7 characters) but is now 320 bytes (140 characters at 8 bytes each) + the 8 bytes for the hair space + whitespace + 8 more bytes for the emoji + the 6 bytes UDH header per message, brings us to a whopping 347 bytes not including overhead.
We were testing international roaming for a customer, roaming into the US where one of our team members (shoutout to Cody) is based.
So we sent Cody some SIMs and asked them to run the basic tests for us, but no IMS APN would attach.
We’d get them to power up the phone, fire up a trace on the roaming PGWs, but never seeing an attempt to attach on the IMS APN (No Create Session Request from the SGW in the vPMN).
The operator we were roaming into swore their side was correct – that IMSI was allowed for the IMS APN for testing, but whenever we’d run a trace, same thing, default bearer just fine but no CSR for the IMS APN, as if the IMS APN was blocked on the roaming network (Which is common on networks where you haven’t launched VoLTE but do have data roaming).
The steps we would do is:
Person in the US turns on phone tries to attach
I fire up my computer, get a trace running
Person in the US airplanes the device
I monitor for CSRs on the PGWs for that IMSI
But no CSR for the IMS APN would ever come through.
After a few attempts, here’s what we found was happening:
The phone would get powered up
The phone would roam onto the vPMN in the US and the default bearer would come up
The phone would try to attach to the ims APN, this would work for this SIM which was whitelisted, and the Create Session Request for the ims APN went to the PGW in the hPMN
The ims APN came up as expected
The phone would send a SIP REGISTER (So far so good)
Our IMS had an issue with Rx routing in this scenario, so the SIP REGISTER would timeout, and when it timed out, a 504 error was sent back to the phone by the P-CSCF and it set the Retry-After header to 3600 seconds.
The phone would not try again for as long as that timer value was set.
At this point we’d start a trace, airplane the device, and see no IMS APN attach attempt.
This 504 Timeout would all happen when the phone fired up before we had any traces running, so we weren’t capturing that.
I’d wrongly assumed that airplaning the device after starting a trace would reset the state fully, but it doesn’t, neither does a restart of the phone.
When we’d started a trace and airplane the phone, the phone wouldn’t try to attach to the ims APN as it was still inside the Retry-After time window from when we’d first fired it up.
Per RFC 3261, the phone should not try again during this time, which in our case meant no attempt to attach to the IMS APN, this makes sense – it protects the network against the thundering herd problem, but made this otherwise simple fault really hard to find.
The device showed up as an rndis_host adapter in Linux, just like a USB NIC, and browsing to the default gateway shows a web UI where amazingly, you can set the IMEI – Pretty sure this isn’t legal…
This is not the IMEI it came with – that’s the IMEI of a EP06-e chip on my desk…
Oddly I could see it was an Android device, but with no adb port exposed, but wait – does that mean this an Android phone in a USB stick?
Alas no DIAG mode serial adapter showed up, despite a few variations on the above.
So with ADB connected could I stream the video from the device and use it like an Android phone?
You betcha:
After poking around in Android I found an App called “Qualcomm Settings” which piqued my interest.
Alas the unit only supports LTE Band 1 / 3 / 5 none of which I have in my office (I’m too lazy to go out to the lab to fire up an Airscale) so I put a public SIM in it and was able to use data, but when I tried to make a call it seems it kicked off CS fallback.
More exploring to do, but pretty amazing what $10 buys you!
So we’ve found this scenario that occurs on some Samsung UEs, in certain radio contions, where midway through an otherwise normal voice call, the UE sends “mystery” data (Not IP data), which in turn causes the UPF to send the error indication and drop the bearer, which in turn drops the call.
The call starts, like any normal call, SIP REGISTER, INVITE, etc.
The P-CSCF / PCRF / PGW set up the dedicated bearer for the voice traffic, and the RTP stream starts flowing over it.
Then the UE sends these weird packets instead of the RTP stream:
These are GTP-U encapsulated data, with the TEID that matches the TEID used for the RTP stream, but there’s no IP data in them – they’re only 14 bytes long and sent by the UE.
Here’s some examples of what’s sent (each line is a packet):
An IPv4 header is 20 bytes long, and IPv6 header is 40, so this is too short for either of those protocols, but what else could it be?
There’s some commonality of course, starts d0 as the first octet, then d1, d2, d3, etc. So that’s something?
I thought perhaps it was a boundary issue, that the standard RTP packet was being split across multiple GTP-U payloads, but that doesn’t appear to be the case.
An Ethernet header is 14 bytes, but if we were to decode this as Ethernet there’s still nothing it’s transporting, and the destination MAC is changing sequentailly if that’s the case, which would be even weirder.
I also thought about RTP that for some reason has lost it’s IP/UDP header, as the sequentially counting byte at the start could be the RTP sequence number, but that’d be 19 bytes minimum and the sequence number is the 3rd and 4th byte, not the first.
Whatever they contain, we see this sent over and over for a few seconds, then bam, back to normal RTP stream flowing.
Or at least it should be, but the invalid packet causes the UPF to generate a GTP-U Error Indication.
These Error Indication payloads eventually lead to the next PFCP Session Report Request having the Error Indication Report (ERIR) flag set to True.
When the PGW-U gets this, it sends a Session Delete Request, which dutifully drops the bearer.
Meaning the session drops on the EPC side, and the RTP drops with it, eventually a BYE is sent from the phone due to RTP timeout.
The above screenshot shows a different cause of GTP-U Error Indication – At this point the bearer has been dropped on the EPC side and these are Error Indications to report it doesn’t know the TIED / bearer.
How to fix this?
Well, unlikely we’ll get a fix on the Samsung side, so we’ll need to not drop the bearer on the PGW-C if we get a lot of Error Indications, and hope for the best.
SIP has got a multitude of ways of showing Caller ID, PAI, R-PAI, From, even Contact, but the other day I got a tip (Thanks John!) that you can set a name as the Caller ID in the “Username field “display name” part of the P-Asserted-Identity for the leg from the TAS to the UE, and it’ll show up on the phone, and they’re right.
One thing that it doesn’t do is show the name in the call history, and if you go to “Add as Contact” it still makes you enter the name, clearly that’s not linked in, but it’s a kinda neat feature.
Nope – it doesn’t do anything useful. So why is it there?
The SUBSCRIBE method in SIP allows a SIP UAC to subscribe to events, and then get NOTIFY messages when that event happens.
In a plain SIP scenario (RFC 3261), we can imagine an IP Phone and a PBX scenario. I might have “Busy Lamp Field” aka BLF buttons on the screen of my phone, that change colour when the people I call often are themselves on calls or on DND, so I know not to transfer calls to them – This is often called the “presence” scenario as it allows us to monitor the presence of another user.
At a SIP level, this is done by sending a SUBSCRIBE to the PBX with the information about what I’m interested in being told about (State changes for specific users) and then the PBX will send NOTIFY messages when the state changes.
But in IMS you’ll see SUBSCRIBE messages every time the subscriber registers, so what are they subscribing for?
Well, you’re just subscribing to your own registration status, but your phone knows your own registration status, because it’s, well, the registration status of the phone.
So what does it achieve? Nothing.
The idea was in a fixed-mobile-convergence scenario (keeping in mind that’s one of the key goals from the 2008 IMS spec) you could have the BLF / presence functionality for fixed subscribers, but this rareley happens.
For the past few years we’ve just been sending a 200 OK to SUBSCRIBE messages to the IMS, with a super long expiry, just to avoid wasting clock cycles.
What is included in the Charging Rule on Gx ultimately turns into a Create Bearer Request on GTPv2-C.
But the mapping it’s always obvious, today I got stuck on the difference between a Single remote port type, and a single local port type, thinking that the Packet Filter Direction in the TFT controlled this – It doesn’t – It’s controlled by the order of your Traffic Flow Template rule.
Input TFT:
"permit out 17 from any 50000 to any"
Leads to Packet filter component type identifier: Single remote port type
Whereas a TFT of:
permit out 17 from any to any 50000
Leads to Packet filter component type identifier: Single local port type (64)
Stumbled across these the other day, while messing around with some values on our SMSc.
Setting the Data Coding Scheme to 16 with GSM7 encoding flags the SMS as “Flash message”, which means it pops up on the screen of the phone on top of whatever the user is doing.
While reading a quality telecom blog bam! There’s the flash SMS popping over whatever I was reading.
Oddly while there’s plenty of info online about Flash SMS, it does not appear in the 3GPP specifications for SMS.
Turns out they still work, move over RCS and A2P, it’s all about Flash messages!
There’s no real secret to this other than to set the Data Coding Scheme to 16, which is GSM7 with Flash class set. That’s it.
Obviously to take advantage of this you’d need to be a network operator, or have access to the network you wish to deliver to. Recently more A2P providers are filtering non vanilla SMS traffic to filter out stuff like SMS OTA message or SIM specific messages, so there’s a good chance this may not work through A2P providers.
I’ve been writing a fair bit recently about the “VoLTE Mess” – It’s something that’s been around for a long time, mostly impacting greenfield players rolling out LTE only, but now the big carriers are starting to feel it as they shut off their 2G and 3G networks, so I figured a brief history was in order to understand how we got here.
Note: I use the terms 4G or LTE interchangeably
The Introduction of LTE
LTE (4G) is more “spectrally efficient” than the technologies that came before it. In simple terms, 1 “chunk” of spectrum will get you more speed (capacity) on LTE than the same size chunk of spectrum would on 2G or 3G.
So imagine it’s 2008 and you’re the CTO of a mobile network operator. Your network is congested thanks to carrying more data traffic than it was ever designed for (the first iPhone had launched the year before) and the network is struggling under the weight of all this new data traffic. You have two options here, to build more cell sites for more density (very expensive) or buy more spectrum (extremely expensive) – Both options see you going cap in hand to the finance team and asking for eye-wateringly large amounts of capital for either option.
But then the answer to your prayers arrives in the form of 3GPP’s Release 8 specification with the introduction of LTE. Now by taking some 2G or 3G spectrum, and by using it on 4G, you can get ~5x more capacity from the same spectrum. So just by changing spectrum you own from 2G or 3G to 4G, you’ve got 5x more capacity. Hallelujah!
So you go to Nortel and buy a packet core, and Alcatel and Siemens provide 4G RAN (eNodeBs) which you selectively deploy on the cell sites that are the most congested. The finance team and the board are happy and your marketing team runs amok with claims of 4G data speeds. You’ve dodged the crisis, phew.
This is the path that all established mobile operators took; throw LTE at the congested cell sites, to cheaply and easily free up capacity, and as the natural hardware replacement cycle kicked in, or cell sites reached capacity, swap out the hardware to kit that supports LTE in addition to the 2G and 3G tech.
Circuit Switched Fallback
But it’s hard to talk about the machinations of late 2000s telecom executives, without at least mentioning Hitler.
This video below from 15 years ago is pretty obscure and fairly technical, but the crux of it it is that Hitler is livid because LTE does not have a “CS Domain” aka circuit switched voice (the way 2G and 3G had handled voice calls).
It was optional to include support for voice calls in the LTE network (Voice over LTE) when you launched LTE services. So if you already had a 2G or 3G network (CS Network) you could just keep using 2G and 3G for your voice calls, while getting that sweet capacity relief.
So our hypothetical CTO, strapped for cash and data capacity, just didn’t bother to support VoLTE when they launched LTE – Doing so would have taken more time to launch, during which time the capacity problem would become worse, so “don’t worry about VoLTE for now” was the mantra.
All the operators who still had 2G and 3G networks, opted to just “Fallback” to using the 2G / 3G network for calling. This is called “Circuit Switched Fallback” aka CSFB.
Operators loved this as they got the capacity relief provided by shifting to 4G/LTE (more capacity in the network is always good) and could all rant about how their network was the fastest and had 4G first, this however was what could be described as a “Foot gun” – Something you can shoot yourself in the foot with in the future.
Operators eventually introduce VoLTE
Time ticked on an operators built out their 4G networks, and many in the past 10 years or so have launched VoLTE in their own networks.
For phones that support it, in areas with blanket 4G coverage, they can use VoLTE for all their calls.
But that’s the sticking point right there – If the phones support it.
But if the phones don’t support it, they’re roaming or making emergency calls, there is always been the safety blanket of 2G or 3G and Circuit Switched fallback to well, fall back to.
There’s no driver for operators who plan to (or are required to) operate a 2G or 3G network for the foreseeable future, to ensure a high level of VoLTE support in their devices.
For an operator today with 2G or 3G, Voice over LTE is still optional. Many operators still rely exclusively on Circuit Switched Fallback, and there are only a handful of countries that have turned off 2G and 3G and rely solely on VoLTE.
VoLTE Handset Support
For the past 16 years phone manufacturers have been making LTE capable phones.
But that does not mean they’ve been making phones that support Voice over LTE.
But it’s never been an issue up until this point, as there’s always been a circuit switched (2G/3G) network to fall back to, so the fact that these chips may not support VoLTE was not a big problem.
Many of the cheaper chipsets that power phones simply don’t support VoLTE – These chips do support LTE for data connections but rely on Circuit Switched Fallback for voice calls. This is in part due to the increased complexity, but also because some of the technologies for VoLTE (like AMR) required intellectual property deals to licence to use, so would add to the component cost to manufacture, and in the chips game, keeping down component cost is critical.
Even for chips that do support Voice over LTE, it’s “special”. Unlike calling in 2G or 3G that worked the same for every operator, phone manufacturers require a “Carrier Bundle” for each operator, containing that specific operators’ special flavor of VoLTE, that operator uses in their network.
This is because while VoLTE is standardized (Despite some claims to the contrary) a lot of “optional” bits have existed, and different operators built networks with subtle differences in the “flavor” of their Voice over LTE (IMS) stack they used. The OEMs (Phone / Chip manufacturers) had to handle these changes in the devices they made, for in order to sell their phones through that operator.
This means I can have a phone from vendor X that works with VoLTE on Network Y, but does not support VoLTE on Network Z.
Worse still, knowing which phones are supported is a bit of a guessing game.
Most operators sell phones directly to their customer base, so buying an Network Y branded phone from Vendor X, you know it’s going to support Network Y’s VoLTE settings, but if you change carriers, who knows if it’ll still support it?
When you’ve still got a Circuit Switched network it’s not the end of the world, you’ll just use CSFB and probably not realize it, until operators go to shut down 2G / 3G networks…
IMS Profile selection on an engineering mode MTK based Android handset
Navigating the Maze of VoLTE Compatibility
Here are some simple checklist you can ask your elderly family members if they ask if their phone is VoLTE compatible:
Does the underlying chipset the phone is based on support VoLTE? (you can find this out by disassembling the phone and checking the datasheets for the components from the OEMs after signing NDAs for each)
Does the underlying chipset require a “carrier bundle” of settings to have been loaded for this operator in order to support VoLTE (See Qualcomm MBM as an example)?
What version of this list am I currently on (generally set in the factory) and does it support this operator? (You can check by decapping the ICs and dumping their NVRAM and then running it through a decompiler)
Does my phones OS (Android / iOS) require a “carrier bundle” of it’s own to enable VoLTE? Is my operator in the version of the database on the phone? (See Android’s Carrier Database for example) (You can find the answer by rooting the phone and running some privileged commands to poke around the internal file system)
Does my operator / MNO support VoLTE – Does my plan / package support VoLTE? (You can easily find the answer by visiting the store and asking questions that don’t appear on the script)
If you managed to answer yes to all of the above, congratulations! You have conditional VoLTE support on your phone, although you probably don’t have a working phone anymore.
Wait, conditional VoLTE support?
That’s right folks, VoLTE will work in some scenarios with your operator!
If you plan on traveling, well your phone may support VoLTE at home, but does the phone have VoLTE roaming enabled? Many phones support VoLTE in the home network, but resort to CSFB when roaming.
If it does support VoLTE roaming, does the network you’re visiting support VoLTE roaming? Has the roaming agreement (IRA) between the operator you’re using while traveling and your home operator been updated to include VoLTE Roaming? These IRAs (AA.12 / AA.13 docs) also indicate if the network must turn off IPsec encryption for the VoLTE traffic when roaming, which is controlled by the phone anyway.
Phew, all this talk of VoLTE roaming while traveling scares me, I think I’ll stay home in the safety of the Australian bush with all these great friendly animals around a phone that supports VoLTE on my home network.
Ah – After spending some time in the Australian bush one of our many deadly animals bit me. Time to call for help! Wait, what about emergency calls over VoLTE? Again, many phones support VoLTE for normal calls, fall back to 2G or 3G for the emergency call, so if you have one of those phones (You’ll only find out if you try to make an emergency call and it fails) and try to make an emergency call in a country without 2G or 3G, you’d better find a payphone.
Sarcasm aside, there’s no dataset or compatibility matrix here – No simple way to see if your phone will work for VoLTE on a given operator, even if the underlying chip does support VoLTE.
Operators in Australia which recently shut down their 3G network, were mandated to block devices that didn’t support VoLTE for emergency calling. They did this using an Equipment Identity Register, and blocking devices based on the Type Allocation Code, but this scattergun approach just blocked non-carrier issued devices, regardless of it they supported VoLTE or VoLTE emergency calling.
Blame Game
So who’s to blame here?
There’s no one group to blame here, the industry has created a shitty cycle here:
Standards orgs for having too many “flavors” available
Operators deploying their own “Flavors” of VoLTE then mandating OEMs / Chip manufacturers comply with their “flavor”.
OEMs / Chip manufactures respond by adding “Carrier Bundles” to account for this per-operator customization
I’ve got some ideas on a way to unscramble this egg, and it’s going to take a push from the industry.
If you’re in the industry and keen to push for a fix, get in touch!
It’s time to get a long term solution to this problem, and we as an industry need to lead the change.
Oh boy this has been a pain in the backside with IMS / VoLTE devices using TCP and how they handle the underlying TCP sockets.
A mobile phone from manufacturer A, wants every SIP dialog to be in it’s own TCP session, while a phone from manufacturer B wants a unique TCP session per transaction, while manufacturer C thinks that every SIP message should reuse the same transaction.
So an MT call to manufacturer A, who wants every SIP dialog in it’s own transaction would look something like this:
PCSCF:44738 -> UE:5060; TCP SYN UE:5060 -> PCSCF:44738; TCP SYN/ACK PCSCF:44738 -> UE:5060; TCP ACK --- TCP connection is now open to UE from P-CSCF--- --- Start of new SIP Transaction 1 & Dialog --- PCSCF:44738 -> UE:5060; TCP PSH - SIP INVITE.... UE:5060 -> PCSCF:44738; TCP ACK
--- Start of SIP Transaction 2 --- PCSCF:44738 -> UE:5060; TCP PSH - SIP BYE.... UE:5060 -> PCSCF:44738; TCP ACK, PSH - SIP 200.... --- End of SIP Transaction 2 & SIP Dialog --- PCSCF:44738 -> UE:5060; TCP FIN UE:5060 -> PCSCF:44738; TCP ACK --- End of TCP Connection ---
Where UE:5060 – is the IP & port of the UE, as advertised in the Contact: header, while PCSCF:44738 is the PCSCF IP and a random TCP port used for this connection.
But for manufacturer B, who wants a unique TCP session per transaction, they want it to look like this:
PCSCF:44738 -> UE:5060; TCP SYN UE:5060 -> PCSCF:44738; TCP SYN/ACK PCSCF:44738 -> UE:5060; TCP ACK --- TCP connection is now open to UE from P-CSCF--- --- Start of new SIP Transaction 1 & Dialog --- PCSCF:44738 -> UE:5060; TCP PSH - SIP INVITE.... UE:5060 -> PCSCF:44738; TCP ACK
--- Start of TCP Session 2 ---- PCSCF:32627 -> UE:5060; TCP SYN UE:5060 -> PCSCF:32627; TCP SYN/ACK PCSCF:32627 -> UE:5060; TCP ACK --- Start of SIP Transaction 2 --- PCSCF:32627 -> UE:5060; TCP PSH - SIP BYE.... UE:5060 -> PCSCF:32627; TCP ACK, PSH - SIP 200.... --- End of SIP Transaction 2 & SIP Dialog --- PCSCF:32627 -> UE:5060; TCP FIN UE:5060 -> PCSCF:32627; TCP ACK --- End of TCP Connection 2 ---
And then manufacturer C wants just the one TCP session to be used for everything, so they open the TCP connection when they register, and that’s all we use for everything.
Is there any logic to this? Nope, seems to be tied to the underlying chipset (Qualcomm vs Mediatek vs Unisoc) and the SIP stack used (Qualcomm, MTK, Unisoc, Samsung, Apple).
We’ve profiled devices into one of 3 behaviors, and then we tag them based on user agent as to what “persona” they demand from the network.
I can’t believe I’m still talking about VoLTE / IMS handset support and it’s almost 2025…. For context IMS was “standardized” 17 years ago.
The Data Coding Scheme (DCS or TP-DCS) header in an SMS body indicates what encoding is used in that message.
It means if we’re using UCS-2 (UTF16) special characters like Emojis etc, in in our message, the phone knows to decode the data in the message body using UTF, because the Data Coding Scheme (DCS) header indicates the contents are encoded in UTF.
Likewise, if we’re not using any fancy characters in our message and the message is encoded as plain old GSM7, we set set the DCS to 0 to indicate this is using GSM7.
From my experience, I’d always assumed that DCS0 (Default) == GSM7, but today I learned, that’s not always the case. Some SMSc entities treat DCS0 as Latin.
Let me explain why this is stupid and why I wasted a lot of time on this.
We can indicate that a message is encoded as Latin by setting the DCS to 0x03:
We cannot indicate that the message is encoded as GSM7 through anything other than the default alphabet (DCS 0).
Latin has it’s own encoding flag, if I wanted the message treated as Latin, I’d indicate the message encoding is Latin in the DCS bit!
I spent a bunch of time trying to work out why a customer was having issues getting messages to subscribers on another operator, and it turned out the other operator treats messages we send to them on SMPP with DCS0 as Latin encoding, and then cracks the sads when trying to deliver it.
The above diff shows the message we send (Right), and the message they dry to deliver (left).
I generally do this with Python or via the Swagger UI for the Web UI, but here’s how we can create a fixed-line IMS subscriber in PyHSS, so we can register it with a softphone, without using EAP-AKA.
Firstly we create the AuC object for this password combo.
PyHSS is our open source Home Subscriber Server, it’s written in Python, has a variety of different backends, and is highly perforate (We benchmark to 10K transactions per second) and infinitely scaleable.
In this post I’ll cover the basics of setting up PyHSS in your enviroment and getting some Diameter peers connected.
For starters, we’ll need a database (We’ll use MySQL for this demo) and an account on that database for a MySQL user.
So let’s get that rolling (I’m using Ubuntu 24.04):
sudo apt update sudo apt install mysql-server
Next we’ll create the MySQL user for PyHSS to use:
CREATE USER 'pyhss_user'@'%' IDENTIFIED BY 'pyhss_password';
GRANT ALL PRIVILEGES ON *.* TO 'pyhss_user'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;
We’ll also need Redis as well (PyHSS uses Redis for inter-service communications and for caching), so go ahead an install that for your distro:
sudo apt install redis-server
So that’s our prerequisites sorted, let’s clone the PyHSS repo:
And install the requirements with pip from the PyHSS repo:
pip3 install -r requirements.txt
Next we’ll need to configure PyHSS, for that we update the config file (config.yaml) with the settings we want to use.
We’ll start by setting the bind_ip to a list of IPs you want to listen on, and your transport – We can use either TCP or SCTP.
For Diameter, we will set OriginHost and OriginRealm to match the Diameter hostname you want to use for this peer, and the Realm of your Diameter network.
Lastly we’ll need to set the database parameters, updating the database: section to populate your credentials, setting your username and password and the database to match your SQL installation we setup at the start.
With that done, we can start PyHSS, which we do using systemctl.
Because there’s multiple microservices that make up PyHSS, there’s multiple systemctl files use to run PyHSS as a service, they’re all in the /systemd folder.
S8 Home Routing is a really simple concept, the traffic goes from the SGW in the visited PLMN to the PGW in the home PLMN, so the PCRF, OCS/OFCS, IMS, IP Addresses, etc, etc, are all in the home network, and this avoids huge amounts of complexity.
But in order for this to work, the visited network MME needs to find the PGW of the home network, and with over 700 roaming networks in commercial use, each one with potentially hundreds of unique APNs each routing to a different PGW, this is a tricky proposition.
If you’ve configured your PGW peers statically on your MME, that’s fine, but it doesn’t scale very well – And if you add an MVNO who wants their own PGW for serving their APN, well you’ll be adding some complexity there to, so what to do?
Well, the answer is DNS.
By taking the APN to be served, the home PLMN and the interface type desired, with some funky DNS queries, our MME can determine which PGW should be selected for a request.
Let’s take a look, for a UE from MNC XXX MCC YYY roaming into our network, trying to access the “IMS” APN.
Our MME knows the network code of the roaming subscriber from the IMSI is MNC XXX, MCC YYY, and that the UE is requesting the IMS APN.
So our MME crafts a DNS request for the NAPTR query for ims.apn.epc.mncXXX.mccYYY.3gppnetwork.org:
Because the domain is epc.mncXXX.mccYYY.3gppnetwork.org it’s routed to the authoritative DNS server in the home network, which sends back the response:
We’ve got a few peers to pick from, so we need to filter this list of Answers to only those that are relevant to us.
First we filter by the Service tag, whihc for each listed peer shows what services that peer supports.
But since we’re looking for S8, we need to find a peer who’s “Service” tag string contains:
x-3gpp-pgw:x-s8-gtp
We’re looking for two bits of info here, the presence of x-3gpp-pgw in the Service to indicate that this peer is a PGW and x-s8-gtp to indicate that this peer supports the S8 interface.
A service string like this:
x-3gpp-pgw:x-s5-gtp
Would be excluded as it only supports S5 not S8 (Even though they are largely the same interface, S8 is used in roaming).
It’s also not uncommon to see both services indicated as supported, in which case that peer could be selected too:
x-3gpp-pgw:x-s5-gtp:x-s8-gtp
(The answers in the screenshot include :x-gp which means the PGWs advertised are also co-located with a GGSN)
So with our answers whittled down to only those that meet our needs, we next use the Order and the Preference to pick our best candidate, this is the same as regular DNS selection logic.
From our candidate, we’ve also got the Regex Replacement, which allows our original DNS request to be re-written, which allows us to point at a single peer.
In our answer, we see the original request ims.apn.epc.mncXXX.mccYYY.3gppnetwork.org is to be re-written to topon.lb1.pgw01.epc.mncXXX.mccYYY.3gppnetwork.org.
This is the FQDN of the PGW we should use.
Now we know the FQND we should use, we just do an A-Record lookup (Or AAAA record lookup if it is IPv6) for that peer we are targeting, to turn that FQDN into an IP address we can use.
And then in comes the response:
So now our MME knows the IP of the PGW, it can craft a Create Session request where the F-TEID for the S8 interface has the PGW IP set on it that we selected.
For more info on this TS 129.303 (Domain Name System Procedures) is the definitive doc, but the GSMA’s IR.88 “LTE and EPC Roaming Guidelines” provides a handy reference.
The S8 Home Routing approach for LTE Roaming works really well, as more and more operators are switching off their legacy circuit switched 2G/3G networks and shifting to LTE & VoLTE for roaming, we’re seeing more an more S8-HR deployments.
When LTE was being standardised in 2008, Local Breakout (LBO) and S8 Home Routing were both considered options for how roaming may look. Fast forward to today, and S8 Home routing is the only way roaming is done for modern deployments.
In light of this, there are some “best practices” in an “all S8 Home Routed” world, we’ve developed, that I thought I’d share.
The Basics
When roaming, the SGW in the Visited Network, sends user traffic back to the PGW in the Home Network.
This means Online/Offline charging, IMS, PCRF, etc, is all done in the Home PLMN. As long as data packets can get from the SGW in the Visited PLMN to the PGW in the Home PLMN, and authentication flows from the Visited MME to the HSS in the Home PLMN, you’re golden.
The Constraints
Of course real networks don’t look as simple as this, in reality a roaming scenario for a visited network has a lot more nodes, which need to be
Building Distributed Packet Core & IMS
Virtualization (VNF / CNF) has led operators away from “big iron” hardware for Packet Core & IMS nodes, towards software based solutions, which in turn offer a lot more flexibility.
Best practice for design of User Plane is to keep the the latency down, by bringing the user plane closer to the user (the idea of “Edge” UPFs in 5GC is a great example of this), and the move away from “big iron” in central locations for SGW and PGW nodes has been the trend for the past decade.
So to achieve these goals in the networks we build, we geographically distribute the core network.
This means we’ve got quite a few S-GW, P-GW, MME & HSS instances across the network. There’s some real advantages to this approach:
From a redundancy perspective this allows us to “spread the load” and build far more resilient networks. A network with 20 smaller HSS instances spread around the country, is far more resilient than 2 massive ones, regardless of how many power feeds or redundant disks it may have.
This allows us to be more resource efficient. MNOs have always provisioned excess capacity to cater for the loss of a node. If we have 2 MMEs serving a country, then each node has to have at least 50% capacity free, so if one MME were to fail, the other MME could handle the additional load it from it’s dead friend. This is costly for resources. Having 20 MMEs means each MME has to have 5% capacity free, to handle the loss of one MME in the pool.
It also forces our infrastructure teams to manage infrastructure “as cattle” rather than pets. These boxes don’t get names or lovingly crafted, they’re automatically spun up and destroyed without thinking about it.
For security, we only use internal IP addresses for the nodes in our packet core, this provides another layer of protection for the “crown jewels” of our network, so no one messing with BGP filtering can accidentally open the flood gates to our core, as one US operator learned leaving a GGSN open to the world leading to the private information for 100 million customers being leaked.
What this all adds to, is of course, the end user experience. For the end subscriber / customer, they get a better experience thanks to the reduced latency the connection provides, better uptime and faster call setup / SMS delivery, and less cost to deliver services.
I love this approach and could prothletise about it all day, but in a roaming context this presents some challenges.
The distributed networks we build are in a constant state of flux, new capacity is being provisioned in some areas, nodes things decommissioned in others, and our our core nodes are only reachable on internal IPs, so wouldn’t be reachable by roaming networks.
Our Distributed-Core Roaming Solution
To resolve this we’ve taken a novel approach, we’ve deployed a pair of S-GWs we call the “Roaming SGWs”, and a pair of P-GWs we call the “Roaming PGWs”, these do have public IPs, and are dedicated for use only by roaming traffic.
We really like this approach for a few reasons:
It allows us to be really flexible do what we want inside the network, without impacting roaming customers or operators who use our network for roaming. All the benefits I described from the distributed architectures can still be realised.
From a security standpoint, only these SGW/PGW pairs have public IPs, all the others are on internal IPs. This good for security – Our core network is the ‘crown jewels’ of the network and we only expose an edge to other providers. Even though IPX networks are supposed to be secure, one of the largest IPX providers had their systems breached for 5 years before it was detected, so being almost as distrustful of IPX traffic as Internet traffic is a good thing. This allows us to put these PGWs / SGWs at the “edge” of our network, and keep all our MMEs, as well as our on-net PGW and SGWs, on internal IPs, safe and secure inside our network.
For charging on the SGWs, we only need to worry about collecting CDRs from one set of SGWs (to go into the TAP files we use to bill the other operators), rather than running around hoovering up SGW CDRs from large numbers of Serving Gateways, which may get blown away and replaced without warning.
Of course, there is a latency angle to this, for international roaming, the traffic has to cross the sea / international borders to get to us. By putting it at the edge we’re seeing increased MOS on our calls, as the traffic is as close to the edge of the network as can be.
Caveat: Increased S11 Latency on Core Network sites over Satellite
This is probably not relevant to most operators, but some of our core network sites are fed only by satellite, and the move to this architecture shifted something: Rather than having latency on the S8 interface from the SGW to the PGW due to the satellite hop, we’ve got latency between the MME and the SGW due to the satellite hop.
It just shifts where in the chain the latency lies, but it did lead to us having to boost some timers in the MME and out of sequence deliver detection, on what had always been an internal interface previously.
Evolution to 5G Standalone Roaming
This approach aligns to the Home Routed options for 5G-SA roaming; UPF chaining means that the roaming traffic can still be routed, as seems to be the way the industry is going.
SA roaming is in its infancy, without widely deployed SA networks, we’re not going to see common roaming using SA for a good long while, but I’ll be curious to see if this approach becomes the de facto standard going forward.
Where to from here?
We’re pretty happy with this approach in the networks we’ve been building.
So far it’s made IREG testing easier as we’ve got two fixed points the IPX needs to hit (The DRAs and the SGWs) rather than a wide range of networks.
Operators with a vast number of APNs they need to drop into different VRFs may have to do some traffic engineering here – Our operations are generally pretty flat, but I can see where this may present some challenges for established operators shifting their traffic.
I’d be keen to hear if other operators are taking this approach and if they’ve run into any issues, or any issues others can see in this, feel free to drop a comment below.
Having rated CDRs in CGrateS is great, but in reality, you probably want to get them into a billing system, CSV file, S3 bucket, CRM, invoice, Grafana, SQL table, etc, etc.
The Event Exporter Service (EES (previously called CDRe)) handles exporting CDRs from CGrateS.
Like everything in CGrateS, it’s highly configurable, and, again, like everything in CGrateS, supports every combination of services you can think of, plus a stack you haven’t thought of.
CDRs can be exported one of two ways, in real time, as the CDR is generated (online), or after the fact, exporting from the database containing the CDRs (offline).
Exporting in realtime (online) is a great option if you don’t want (or need) to store the CDRs in CGrateS; if you’re just using CGrateS to rate calls and spit them into a seperate system, this is a fantastic option, as it allows your CGrateS instances to remain light and not get clogged up with lots of old CDRs – That said, of course you can export the CDRs in realtime and still store them in CGrateS, that’s also a totally valid approach as well.
The more traditional approach is offline CDR export, where periodically or when an event is triggered, you scrape up a pile of CDRs and send them to your external systems.
For both options, we’ll need to define at least one exporter in our cgrates.json config file. For this example we’ll define a HTTP POST that we will trigger for realtime (online) CDR exporting, and a CSV file we dump to periodically when called from the API.
So first things first, we enable the EES module in the config:
"ees": {
"enabled": true,
"exporters": [
]
}
We’ll start with defining one exporter, named CSVExporter, that will output files to a folder named “testCSV” in the /tmp/ directory, but you can plonk these files wherever you like:
We’ve got a lot of different types of export available to us, but type *file_csv is the easiest, so that’s where we’ll start.
Setting synchronous to true will mean we’ll only run one export job at a time, but it also means we’ll get back the result via the API, which will allow us to keep track of the ID of the last record we updated, so we don’t export the same record multiple times, more on this later.
Flags allows us to, if we wanted, bounce the event through AttributeS, for example, by adding *attributes to the flags, but in this case, it’s just logging to syslog.
Of course, just enabling ees won’t actually send calls to it, we’ll need to add “ees_conns“: [“*localhost”], to “apiers”: and “cdrs” so they know to bounce the events through it:
If you’ve already got CDRs on your system from our previous tutorial, fantastic, but if not, let’s get up and running with a quick and dirty script to define some destinations, a charger, an account balance and then use some of the balance to generate a CDR:
import cgrateshttpapi
import pprint
import uuid
import datetime
now = datetime.datetime.now()
CGRateS_Obj = cgrateshttpapi.CGRateS('localhost', 2080)
#Define Destinations
CGRateS_Obj.SendData({'method':'ApierV2.SetTPDestination','params':[{"TPid":'cgrates.org',"ID":"Dest_AU_Mobile","Prefixes":["614"]}]})
#Load TariffPlan we just defined from StorDB to DataDB
CGRateS_Obj.SendData({"method":"APIerSv1.LoadTariffPlanFromStorDb","params":[{"TPid":'cgrates.org',"DryRun":False,"Validate":True,"APIOpts":None,"Caching":None}],"id":0})
#Define default Charger
print(CGRateS_Obj.SendData({"method": "APIerSv1.SetChargerProfile","params": [{"Tenant": "cgrates.org","ID": "DEFAULT",'FilterIDs': [],'AttributeIDs' : ['*none'],'Weight': 0,}]}))
account = "Nick_Test_123"
#Add a balance to the account with type *sms with 100 sms events
pprint.pprint(CGRateS_Obj.SendData({"method": "ApierV1.SetBalance","params": [{"Tenant": "cgrates.org","Account": account,"BalanceType": "*sms","DestinationIDs": 'Dest_NZ_Mobile;Dest_AU_Mobile',"Categories": "*any","Balance": {"ID": "100_SMS_Bundle_AU_NZ_Mobile","Value": 100,"Weight": 25}}]}))
#Process CDR Event for a single SMS
pprint.pprint(CGRateS_Obj.SendData({"method": "CDRsV2.ProcessExternalCDR","params": [{"OriginID": str(uuid.uuid1()),"ToR": "*sms","RequestType": "*pseudoprepaid","AnswerTime": now.strftime("%Y-%m-%d %H:%M:%S"),"SetupTime": now.strftime("%Y-%m-%d %H:%M:%S"),"Tenant": "cgrates.org","Account": account,"Destination" : "61412345678","Usage": "1",}]}))
Right, with that out of the way, we should now have something in our CDRs table, a quick SQL query confirms this is the case:
So, as you may have guessed, we’ve called the ExportCDRs API endpoint, we’ve specified which ExporterIDs we want to reference (these link back to the objects in the config, and the one we have defined currently is named CSVExporter).
Setting Verbose: True means that CGrateS gives us back a lot of info from the API call, here’s what we get back:
Now that looks pretty positive, we got 12 events of SMS usage exported, which we can see in the file /tmp/testCSV/CSVExporter_21e9bc2.csv – and if we cat out the file, yeap, there’s all the CDRs.
But it’s a bit of a mess, there’s a lot of fields in there, so let’s adjust what goes into the CSV.
Let’s start by filtering what goes into the exporter, to only give us SMS events, of course you could adjust the filters here to target exporting only the records you want, based on anything you can define with Filters (and there’s a lot you can define with filters).
Now we’re only exporting SMS records, so let’s clean up the output of the CSV to just give us the data we want, which is the CDR ID, time, account, destination and usage.
Now after a restart of CGrateS, our exports look like this:
Stunning, truly beautiful, look at that output!
Right, well you may at this point have noticed a problem if you’ve run this more than once. The problem is that is every time we run this, we get all the CDRs since the beginning of time.
But where filtering by date/time falls down, is that if an offline CDR of a call on Monday, only got ingested on Tuesday, it would be missed by the export.
But, setting Verbose: True on the ExportCDRs API call gives us a handy trick, we’ve been told what the highest ID in the CDRs table we just exported in the response from the API in LastExpOrderID field.
If we jump over to the SQL database we use for StorDB, we can see that 33 is the ID of the highest CDR in the system.
So let’s try something, let’s run the exporter again, but this time let’s get all the CDRs where the ID is higher than 33:
#Process CDR Event for a single SMS
pprint.pprint(CGRateS_Obj.SendData({"method": "CDRsV2.ProcessExternalCDR","params": [{"OriginID": str(uuid.uuid1()),"ToR": "*sms","RequestType": "*pseudoprepaid","AnswerTime": now.strftime("%Y-%m-%d %H:%M:%S"),"SetupTime": now.strftime("%Y-%m-%d %H:%M:%S"),"Tenant": "cgrates.org","Account": account,"Destination" : "61412345678","Usage": "1",}]}))
#Trigger export where the OrderID is above 33
result = CGRateS_Obj.SendData({"method":"APIerSv1.ExportCDRs","params":[
{"ExporterIDs": ["CSVExporter"],
"Verbose" : True,
"ExtraArgs" : {
"OrderIDStart" : int(33),
},
"Accounts" : [account]}
]})
pprint.pprint(result)
Boom, now if we have a look at the output we can see the export covered two records, and the last ID was 35.
So as long as we keep track of the LastExpOrderID value, and feed that as in input every time we run ExportCDRs, we can ensure we never miss a CDR, and never get the same CDR twice.
I got an email the other day asking a simple question:
How do I know if a subscriber is VoLTE roaming or not when they send an SMS to charge for it?
My immediate reaction was to look at the SIP headers, P-Access-Network-Info will tell you where the subscriber is located, end of.
Right?
Well not quite, this will tell the SMSc the location of the subscriber sending the SMS. If the PLMN in the P-Access-Network-Info != the home PLMN, the sub is roaming.
But does this information get passed to the OCS / OFCS?
The SMSc uses “Event based charging” to perform credit control, so let’s have a look at what AVPs are present in the Credit Control Request from the SMSc:
Hmm, the SMS-Information AVP (2000) contains a bunch of information about the SMS being sent, but I don’t see anything about the location of the sender in there.
Originator-Interface is just set to “SIP”, of course in a 2G/3G roaming scenario the Originator-SCCP-Address would be that of the Visited PLMN, but for us it is our SCCP address.
Maybe the standard allows for an additional optional AVP in the SMS-Information-AVP we’re missing? Let’s check TS 32.299:
Nope.
So how to deal with this?
While the standards aren’t totally clear on this, we added an IMS-Info AVP and inside that populated the Access-Network-Information directly from the SIP header, and then picked that off inside our OCS in order to apply the correct rules.
If you’re used to dealing with SIP, you’d expect to see:
Content-Type: application/sdp
This Content-Type multipart/mixed;boundary is totally valid in SIP, in fact RFC 5261 (Message Body Handling in the Session Initiation Protocol (SIP)) details the use of MIME in SIP, and the Geolocation extension uses this, as we see below from a 911 call example.
But while this extension is standardised, and having your SIP Body containing multipart MIME is legal, not everything supports this, including the FreeSWITCH bridge module, which just appends a new SDP body into the Mime Multipart
Site note: I noticed FreeSWITCH Bridge function just appends the new SIP body in the multipart MIME, leaving the original, SDP:
Okay, so how do we replace the MIME Multipart SIP body with a standard SDP?
#If the body is multipart then strip it and replace with a single part
if (has_body("multipart/mixed")) {
xlog("This has a multipart body");
if (filter_body("application/sdp")) {
remove_hf("Content-Type");
append_hf("Content-Type: application/sdp\r\n");
} else {
xlog("Body part application/sdp not found\n");
}
}
Android, being open source, allows us to see how this logic works, and it’s important for operators to understand this logic, as it’s what dictates the behavior in many scenarios.
It’s important to note that I’m not covering Apple here, this information is not publicly available to share for iOS devices, so I won’t be sharing anything on this – Apple has their own ecosystem to handle emergency calling, if you’re from an operator and reading this, I’d suggest getting in touch with your Apple account manager to discuss it, they’re always great to work with.
The Android Open Source Project has an “emergency number database”. This database has each of the emergency phone numbers and the corresponding service, for each country.
This file can be read at packages/services/Telephony/ecc/input/eccdata.txt on a phone with engineering mode.
Let’s take a look what’s in mainline Android for Australia: