TADIG | Nick vs Networking

The S8 Home Routing approach for LTE Roaming works really well, as more and more operators are switching off their legacy circuit switched 2G/3G networks and shifting to LTE & VoLTE for roaming, we’re seeing more an more S8-HR deployments.

When LTE was being standardised in 2008, Local Breakout (LBO) and S8 Home Routing were both considered options for how roaming may look. Fast forward to today, and S8 Home routing is the only way roaming is done for modern deployments.

In light of this, there are some “best practices” in an “all S8 Home Routed” world, we’ve developed, that I thought I’d share.

The Basics

When roaming, the SGW in the Visited Network, sends user traffic back to the PGW in the Home Network.

This means Online/Offline charging, IMS, PCRF, etc, is all done in the Home PLMN. As long as data packets can get from the SGW in the Visited PLMN to the PGW in the Home PLMN, and authentication flows from the Visited MME to the HSS in the Home PLMN, you’re golden.

The Constraints

Of course real networks don’t look as simple as this, in reality a roaming scenario for a visited network has a lot more nodes, which need to be

Building Distributed Packet Core & IMS

Virtualization (VNF / CNF) has led operators away from “big iron” hardware for Packet Core & IMS nodes, towards software based solutions, which in turn offer a lot more flexibility.

Best practice for design of User Plane is to keep the the latency down, by bringing the user plane closer to the user (the idea of “Edge” UPFs in 5GC is a great example of this), and the move away from “big iron” in central locations for SGW and PGW nodes has been the trend for the past decade.

So to achieve these goals in the networks we build, we geographically distribute the core network.

This means we’ve got quite a few S-GW, P-GW, MME & HSS instances across the network.
There’s some real advantages to this approach:

From a redundancy perspective this allows us to “spread the load” and build far more resilient networks. A network with 20 smaller HSS instances spread around the country, is far more resilient than 2 massive ones, regardless of how many power feeds or redundant disks it may have.

This allows us to be more resource efficient. MNOs have always provisioned excess capacity to cater for the loss of a node. If we have 2 MMEs serving a country, then each node has to have at least 50% capacity free, so if one MME were to fail, the other MME could handle the additional load it from it’s dead friend. This is costly for resources. Having 20 MMEs means each MME has to have 5% capacity free, to handle the loss of one MME in the pool.

It also forces our infrastructure teams to manage infrastructure “as cattle” rather than pets. These boxes don’t get names or lovingly crafted, they’re automatically spun up and destroyed without thinking about it.

For security, we only use internal IP addresses for the nodes in our packet core, this provides another layer of protection for the “crown jewels” of our network, so no one messing with BGP filtering can accidentally open the flood gates to our core, as one US operator learned leaving a GGSN open to the world leading to the private information for 100 million customers being leaked.

What this all adds to, is of course, the end user experience.
For the end subscriber / customer, they get a better experience thanks to the reduced latency the connection provides, better uptime and faster call setup / SMS delivery, and less cost to deliver services.

I love this approach and could prothletise about it all day, but in a roaming context this presents some challenges.

The distributed networks we build are in a constant state of flux, new capacity is being provisioned in some areas, nodes things decommissioned in others, and our our core nodes are only reachable on internal IPs, so wouldn’t be reachable by roaming networks.

Our Distributed-Core Roaming Solution

To resolve this we’ve taken a novel approach, we’ve deployed a pair of S-GWs we call the “Roaming SGWs”, and a pair of P-GWs we call the “Roaming PGWs”, these do have public IPs, and are dedicated for use only by roaming traffic.

We really like this approach for a few reasons:

It allows us to be really flexible do what we want inside the network, without impacting roaming customers or operators who use our network for roaming. All the benefits I described from the distributed architectures can still be realised.

From a security standpoint, only these SGW/PGW pairs have public IPs, all the others are on internal IPs. This good for security – Our core network is the ‘crown jewels’ of the network and we only expose an edge to other providers. Even though IPX networks are supposed to be secure, one of the largest IPX providers had their systems breached for 5 years before it was detected, so being almost as distrustful of IPX traffic as Internet traffic is a good thing.
This allows us to put these PGWs / SGWs at the “edge” of our network, and keep all our MMEs, as well as our on-net PGW and SGWs, on internal IPs, safe and secure inside our network.

For charging on the SGWs, we only need to worry about collecting CDRs from one set of SGWs (to go into the TAP files we use to bill the other operators), rather than running around hoovering up SGW CDRs from large numbers of Serving Gateways, which may get blown away and replaced without warning.

Of course, there is a latency angle to this, for international roaming, the traffic has to cross the sea / international borders to get to us. By putting it at the edge we’re seeing increased MOS on our calls, as the traffic is as close to the edge of the network as can be.

Caveat: Increased S11 Latency on Core Network sites over Satellite

This is probably not relevant to most operators, but some of our core network sites are fed only by satellite, and the move to this architecture shifted something: Rather than having latency on the S8 interface from the SGW to the PGW due to the satellite hop, we’ve got latency between the MME and the SGW due to the satellite hop.

It just shifts where in the chain the latency lies, but it did lead to us having to boost some timers in the MME and out of sequence deliver detection, on what had always been an internal interface previously.

Evolution to 5G Standalone Roaming

This approach aligns to the Home Routed options for 5G-SA roaming; UPF chaining means that the roaming traffic can still be routed, as seems to be the way the industry is going.

SA roaming is in its infancy, without widely deployed SA networks, we’re not going to see common roaming using SA for a good long while, but I’ll be curious to see if this approach becomes the de facto standard going forward.

Where to from here?

We’re pretty happy with this approach in the networks we’ve been building.

So far it’s made IREG testing easier as we’ve got two fixed points the IPX needs to hit (The DRAs and the SGWs) rather than a wide range of networks.

Operators with a vast number of APNs they need to drop into different VRFs may have to do some traffic engineering here – Our operations are generally pretty flat, but I can see where this may present some challenges for established operators shifting their traffic.

I’d be keen to hear if other operators are taking this approach and if they’ve run into any issues, or any issues others can see in this, feel free to drop a comment below.

If you’ve ever worked in roaming, you’ll probably have had the misfortune of dealing with Transferred Account Procedures aka TAP files.

It’s used for billing a 2G GSM call right up to 5G data usage, if you use a service while roaming, somewhere in the world there’s a TAP file with your usage in it.

A brief history of TAP

TAP was originally specified by the GSMA in 1991 as a standard CDR interchange format between operators, for use in roaming scenarios.

Notice I said GSMA – Not 3GPP – This means there’s no 3GPP TS docs for this, it’s defined by the industry lobby group’s members, rather than the standards body.

So what does this actually mean? Well, if you’re MNO A and a customer from MNO B roams into your network, all the calls, SMS and data consumed by the roaming subscriber from MNO B will need to be billed to MNO B, by you, MNO A.

If a network operator wants to get paid for traffic used on their network by roaming subscribers, they’d better send out a TAP file to the roamer’s home network.

TAP is the file format generated my MNO A and sent to MNO B, containing all the usage charges that subscribers from MNO B have racked up while roaming into your network.

These are broken down into “Transactions” (CDRs), for events like making a call, connecting a PDN session and consuming data, or sending a text.

In the beginning of time, GSM provided only voice calling service. This meant that the only services a subscriber could consume while roaming was just making/receiving voice calls which were billed at the end of each month. – This meant billing was equally simple, every so often the visisted network would send the TAP files for the voice calls made by subscribers visited other networks, to the home networks, which would markup those charges, and add them onto the monthly invoice for each subscriber who was roaming.

But of course today, calling accounts for a tiny amount of usage on the network, but this happened gradually while passing through the introduction of SMS, CAMEL services, prepaid services, mobile data, etc.
For all these services that could be offered, the TAP format had to evolve to handle each of these scenarios.

As we move towards a flat IP architecture, where voice calls and SMS sent while roaming are just data, TAP files for 4G and 5G networks only need to show data transactions, so the call objects, CAMEL parameters and SMS objects are all falling by the wayside.

What’s inside a TAP File

TAP uses the most beloved of formats – ASN1 to encode the data. This means it is strictly formatted and rigidly specified.

Each file contains a Sequence Number which is a monotonically increasing number, which allows the receiver to know if any files have been missed between the file that’s being currently parsed, an the previous file.

They also have a recipient and sender TADIG code, which is a code allocated by GSMA that uniquely identifies the sender and the recipient of the file.

The TAP records exist in one of two common format, Notification Records and transferBatch records.

These files are exchanged between operators, in practice this means “Dumped on an FTP server as agreed between the two”.

TAP Notification Records

Notifications are the simplest of TAP records and are used when there aren’t any CDRs for roaming events during the time period the TAP file covers.

These are essentially blank TAP files generated by the visited network to let the home network know it’s still there, but there are no roaming subs consuming services in that period.

Notification files are really simple, let’s take a look as one shown as JSON:

{
  "notification": {
    "fileAvailableTimeStamp": {
      "localTimeStamp": "2023/03/21  15:44:18", 
      "utcTimeOffset": "-0900"
    }, 
    "fileCreationTimeStamp": {
      "localTimeStamp": "2023/03/21  15:44:18", 
      "utcTimeOffset": "-0900"
    }, 
    "fileSequenceNumber": "00003", 
    "recipient": "RECV", 
    "releaseVersionNumber": 12, 
    "sender": "SEND", 
    "specificationVersionNumber": 3, 
    "transferCutOffTimeStamp": {
      "localTimeStamp": "2023/03/21  15:44:18", 
      "utcTimeOffset": "-0900"
    }
  }
}

Pretty simple!

TAP transferBatch Records

When we have services to bill and records to charge, that’s when instead we generate a transferBatch record.

It looks something like this:

There’s a lot going on in here, so let’s break it down section by section.

accountingInfo

The accountingInfo section specifies the currency, exchange rate parameters.

Keep in mind a TAP record generated by an operator in the US, would use USD, while the receiver of the file may be a European MNO dealing in EUR.

This gets even more complicated if you’re dealing with more obscure currencies where an intermediary currency is used, that’s where we bring in SDRs (“Special Drawing Right”) that map to the dollar value to be charged, kinda – the roaming agreement defines how many SDRs are in a dollar, in the example below we’re not using any, but you do see it.

When it comes to numbers and decimal places, TAP doesn’t exactly make it easy.

Significant Digits are defined by counting the first number before the decimal point and all the numbers to the right of the decimal point, so for example the number 1.234 would be 4 significant digits (1 digit before the decimal point and 3 digits after it).

Decimal Places are not actually supported for the Value fields in the TAP file. This is tricky because especially today when roaming tariffs are quite low, these values can be quite small, and we need to represent them as an integer number. TAP defines decimal places as the number of digits after the decimal place.

When it comes to the maximum number of decimal places, this actually impacts the maximum number we can store in the field – as ASN1 strictly enforce what we put in it.

"accountingInfo": {
  "currencyConversionInfo": [
    {
      "exchangeRate": 1,
      "exchangeRateCode": 1,
      "numberOfDecimalPlaces": 0
    }
  ],
  "localCurrency": "USD",
  "tapCurrency": "USD",
  "tapDecimalPlaces": 5
},

auditControlInfo

The auditControlInfo section contains the number of CDRs (callEventDetailsCount) contained in the TAP file, the timestamp of the first and last CDR in the file, the total charge and any tax charged.

All of the currency information was provided in the accountingInfo so this is just giving us our totals.

"auditControlInfo": {
  "callEventDetailsCount": 6,
  "earliestCallTimeStamp": {
    "localTimeStamp": "2023/01/24  18:53:35",
    "utcTimeOffset": "-0900"
  },
  "latestCallTimeStamp": {
    "localTimeStamp": "2023/02/14  20:46:11",
    "utcTimeOffset": "-0900"
  },
  "totalCharge": 17920,
  "totalDiscountValue": 0,
  "totalTaxValue": 0
},

A CDR has 30 days from the time it was generated / service consumed by the roamer, to be baked into a TAP file. After this we can no longer charge for it, so it’s important that the earliestCallTimeStamp is not more than 30 days before the fileCreationTimeStamp seen in batchControlInfo.

batchControlInfo

The batchControlInfo section specifies the time the TAP file became available for transfer, the time the file was created (usually the same), the sequence number and the sender / recipient TADIG codes.

As mentioned earlier, we track sequence number so the receiver can know if a TAP file has been missed; for example if you’ve got TAP file 1 and TAP file 3 comes in, you can determine you’ve missed TAP file 2.

"batchControlInfo": {
  "fileAvailableTimeStamp": {
    "localTimeStamp": "2023/03/22  20:14:42",
    "utcTimeOffset": "-0900"
  },
  "fileCreationTimeStamp": {
    "localTimeStamp": "2023/03/22  20:14:42",
    "utcTimeOffset": "-0900"
  },
  "fileSequenceNumber": "00001",
  "recipient": "RECV",
  "releaseVersionNumber": 12,
  "sender": "SEND",
  "specificationVersionNumber": 3,
  "transferCutOffTimeStamp": {
    "localTimeStamp": "2023/03/22  20:14:42",
    "utcTimeOffset": "-0900"
  }
},

callEventDetails / gprsCall

Now we’re getting to the meat & potatoes of our TAP record, the CDRs themselves.

In LTE networks these are just records of data consumption, so let’s take a look inside the gprsCall records under callEventDetails:

In the gprsBasicCallInformation we’ve got as the name suggests the basic info about the data usage event. The time when the session started, the charging ID, the IMSI and the MSISDN of the subscriber to charge, along with their IP and the APN used.

"callEventDetails" { "gprsCall": {
          "gprsBasicCallInformation": {
            "callEventStartTimeStamp": {
              "localTimeStamp": "2023/01/24  18:53:56",
              "utcTimeOffsetCode": 1
            },
            "chargingId": 18290321,
            "gprsChargeableSubscriber": {
              "chargeableSubscriber": {
                "simChargeableSubscriber": {
                  "imsi": "001010000000002",
                  "msisdn": "142151232"
                }
              },
              "pdpAddress": "f21a:b5b9:a0b1:e568:a531:91f5:dc5e:899e"
            },
            "gprsDestination": {
              "accessPointNameNI": "internet",
              "accessPointNameOI": "mnc001.mcc001.gprs"
            },
            "totalCallEventDuration": 84
          },

Next up we have the gprsLocationInformation – rates and tariffs may be set based on the location of the subscriber, so we need to identify the area the sub was using the services to select correct tariff / rate for traffic in this destination.

The recEntity is the index number of the SGW / PGW used for the transaction (more on that later).

"callEventDetails" {  "gprsCall": {
          "gprsLocationInformation": {
            "geographicalLocation": {
              "servingBid": "0002",
              "servingLocationDescription": "Nick Lab AU"
            },
            "gprsNetworkLocation": {
              "cellId": 00002,
              "locationArea": 100,
              "recEntity": [
                4,
                2
              ]
            }
          },

Next we have the gprsServiceUsed which, again as the name suggests, details the services used and the charge.

chargeDetailList contains the charged data (Made up of dataVolumeIncoming + dataVolumeOutgoing) and the cost.

The chargeableUnits indicates the actual data consumed, however most roaming agreements will standardise on some level of rounding, for example rounding up to the nearest Kilobyte (1024 bytes), so while a sub may consume 1025 bytes of data, they’d be billed for 2045 bytes of data. The data consumed is indicated in the chargeableUnits which indicates how much data was actually consumed, before any rounding policies where applied, while the amount that is actually charged (When taking into account rounding policies) is indicated inside Charged Units.

In the example below data usage is rounded up to the nearest 1024 bytes, 134390 bytes rounds up to the nearest 1024 gives you 135168 bytes.

As this is data we’re talking bytes, but not all bytes are created equal!

VoLTE traffic, using a QCI1 bearer is more valuable than QCI 9 cat videos, and TAP records take this into account in the Call Type Groups, each of which has a different price – Call Type Level 1 indicates the type of traffic, for S8 Home Routed LTE Traffic this is 10 (HGGSN/HP-GW), while Call Type Level 2 indicates the type of traffic as mapped to QCI values:

20 Unspecified/default LTE QCIs
21 LTE QCI 1 Conversational
22 LTE QCI 2 Conversational
23 LTE QCI 3 Conversational
24 LTE QCI 4 Streaming
25 LTE QCI 5 Interactive (specialised for signalling)
26 LTE QCI 6 Interactive
27 LTE QCI 7 Interactive
28 LTE QCI 8 Interactive
29 LTE QCI 9 Background

So Call Type Level 2 set to 20 indicates that this is “20 Unspecified/default LTE QCIs”, and Call Type Level 3 can be set to any value based on a defined inter-operator tariff.

Here’s the example as JSON:

"callEventDetails" {  "gprsCall": {
          "gprsServiceUsed": {
            "chargeInformationList": [
              {
                "callTypeGroup": {
                  "callTypeLevel1": 10,
                  "callTypeLevel2": 20,
                  "callTypeLevel3": 0
                },
                "chargeDetailList": [
                  {
                    "charge": 1320,
                    "chargeType": "3030",
                    "chargeableUnits": 134390,
                    "chargedUnits": 135168
                  }
                ],
                "chargedItem": "58",
                "exchangeRateCode": 1
              }
            ],
            "dataVolumeIncoming": 65536,
            "dataVolumeOutgoing": 69632
          },
          "operatorSpecInformation": [
            "RAT:6",
          ]
        }
      }

These gprsCall objects are added for each CDR in the TAP file (as indicated by callEventDetailsCount).

networkInfo

Lastly we have the networkInfo which contains the recEntityInfo which gives each network element used in the transaction an ID that can be referenced.

    "networkInfo": {
      "recEntityInfo": [
        {
          "recEntityCode": 1,
          "recEntityId": "1.2.3.4",
          "recEntityType": 7
        },
        {
          "recEntityCode": 2,
          "recEntityId": "1.2.3.5",
          "recEntityType": 8
        },
      ],
      "utcTimeOffsetInfo": [
        {
          "utcTimeOffset": "-0900",
          "utcTimeOffsetCode": 1
        }
      ]
    }

recEntityType 7 means a PGW and contains the IP of the PGW in the Home PLMN, while recEntityType 8 means SGW and is the SGW in the Visited PLMN.

So this means if we reference recEntityCode 2 in a gprsCall, that we’re referring to an SGW at 1.2.3.5.

Lastly also got the utcTimeOffsetInfo to indicate the timezones used and assign a unique code to it.

Using the Records

We as humans? These records aren’t meant for us.

They’re designed to be generated by the Visited PLMN and sent to to the home PLMN, which ingests it and pays the amount specified in the time agreed.

Generally this is an FTP server that the TAP records get dumped into, and an automated bank transfer job based on the totals for the TAP records.

Testing of the TAP records is called “TADIG Testing” and it’s something we’ll go into another day, but in essence it’s validating that the output and contents of the files meet what both operators think is the contract pricing and specifications.

So that’s it! That’s what’s in a TAP record, what it does and how we use it!

GSMA are introducing BCE – Billing & Charging Evolution, a new standard, designed to last for the next 30+ years like TAP has. It’s still in its early days, but that’s the direction the GSMA has indicated it would like to go.

Nick vs Networking

Telco Network Engineering

Tag Archives: TADIG

Best Practices for SGW & PGW Deployment Architectures for Roaming