I was diffing two PCAPs the other day trying to work out what’s up, and noticed the Instance ID on a GTPv2 IE was different between the working and failing examples.
If more than one grouped information elements of the same type, but for a different purpose are sent with a message, these IEs shall have different Instance values.
So if we’ve got two IEs of the same IE type (As we often do; F-TEIDs with IE Type 87 may have multiple instances in the same message each with different F-TEID interface types), then we differentiate between them by Instance ID.
The only exception to this rule is where we’ve got the same data, so if you’ve got one IE with the exact same values and purpose that exists twice inside the message.
Last year we deployed some Hughes HL1120W OneWeb terminals in one of the remote cellular networks we support.
Unfortunately it was failing to meet our expectations in terms of performance and reliability – We were seeing multiple dropouts every few hours, for between 30 seconds and ~3 minutes at a time, and while our reseller was great, we weren’t really getting anywhere with Eutelsat in terms of understanding why it wasn’t working.
Luckily for us, Hughes (who manufacture the OneWeb terminals) have an unprotected API (*facepalm*) from which we can scrape all the information about what the terminal sees.
As that data is in an API we have to query, I knocked up a quick Python script to poll the API and convert the data from the API into Prometheus data so we could put it into Grafana and visualise what’s going on with the terminals and the constellation.
After getting all this into Grafana and combining it with the ICMP Blackbox exporter (we configured Blackbox to send HTTP requests and ICMP pings out of each of the different satellite terminals we had (a mix of OneWeb and others)) we could see a pattern emerging where certain “birds” (satellites) that passed overhead would come with packet loss and dropouts.
It was the same satellites each time that led to the drops, which allowed us to pinpoint to say when we see this satellite coming over the horizon, we know there’s going to be some packet loss.
In the end Eutelsat acknowledged they had two faulty satellites in the orbit we are using, hence seeing the dropouts, and they are currently working on resolving this (but that actually does require rockets, so we’re left without a usable service for the time being) but it was a fun problem to diagnose and a good chance to learn more about space.
Packet loss on the two OneWeb terminals (Not seen on other constellation) correlated with a given satellite pass
The repo has instructions for use and the Grafana templates we used.
At one point I started playing with the OneWeb Ephemeris data so I could calculate the azimuth and elevation of each of the birds from our relative position, and work out distances and angles from the terminal. The maths was kinda fun, but oddly the datetimes in the OneWeb ephemeris data set seems to be about 10 years and 10 days behind the current datetime – Possibly this gives an insight into OneWeb’s two day outage at the start of the year due to their software not handling leap years.
Despite all these teething issues I’m still optimistic about OneWeb, Kupler and Qianfan (Thousand Sails) opening up the LEO market and covering more people in more places.
Update: Thanks to Scott via email who sent this: One note, there’s a difference between GPS time and Unix time of about 10 years 5 days. This is due to a) the Unix epoch starting 1970-01-01 and the gps epoch starting 1980-01-05 and b) gps time is not adjusted for leap seconds, and ends up being offset by an integer number of seconds.
One of the really neat features about using automated RF planning tools like Forsk Atoll is you’re able to get it to automatically try out tweaks and look at how that impacts performance.
In the past you’d adjust something, run the simulation again, look at the results and compare to what you had before,
Atoll’s ACP (Automatic Cell Planning) module allows you to automate this, and in most cases, it does a better job than I would!
Today we’ll look at Cell Site Selection in Atoll.
To begin with we’ll limit the computation area down to a polygon we draw around the area in question,
In the Geo tab we’ll select Zones -> Computation Zone and select Edit
We’ll create a new Polygon and draw around the area we are going to analyze. You can automate this step based on population levels, etc, if you’ve got that data present.
So now we’ve set our computation area to the selection, but if we didn’t do this, we’d be computing for the whole world, and that might take a while…
Generating Candidate Sites
Atoll sucks at this, I’ve found if your computation zone is set, and it’s not a rectangle, bad things happen, so I’ve written a little script to generate candidates for me.
Creating an new ACP Job
From the Network tab, right click on ACP Automatic Cell Planning and select New
Optimization Tab
Before we can define all the specifics of what we’re looking to plan / improve, we need to set some limits on the software itself and tell it what we’re looking to improve.
The resolution defines how precise the results should be, and the iterations defines how many changes the software should run through.
The higher the number of iterations, the better the results, but it’s not linear – The improvement between 1000 iterations and 1,000,000,000 iterations is typically pretty minor, and this is because ACP works kind of a “getting warmer” philosophy, where it changes a value up or down, looks at the overall result and then if the result was better, changes the value again until it stops getting better.
As I’m working in a fairly small area I’m going to set 100 iterations and a 50m resolution.
In the optimization tab we can also set constraints, for example we’re looking at where to place cell sites in an area, and as far as Atoll is concerned if we just throw hundreds of sites at an area we’ll have pretty good results, but the economics of that doesn’t work, so we can set constraints, for example for site selection we may want to set the max number of cell sites. As we are importing ~5k candidate locations, we probably don’t want to build 5k cell sites 20m apart, so set this to be a reasonable number for your geography.
When using ACP for Optimization as we can see later on, we can also set cost constraints regarding the cost to make changes, but for now this is just going to pick best cell sites locations for us.
Objectives Tab
Next up we’ll need to setup Automatic Cell Plannings’ objectives.
For ACP to be an effective tool we need to define what we’re looking for in terms of success, you can’t just throw it some values and say “Make it better” – we need to define what parameters we’re looking to improve. We do this by setting Objectives.
Your objectives are going to be based on your needs and wants, but for this example we’re building greenfield networks, so want to offer coverage over an area, as well as good RSRP and RSRQ, so we will set the objectives to Coverage of 95% of the Computation Zone for this post, with a secondary objective of increasing RSRP and RSRQ.
But today I’m modeling for coverage, so let’s set that:
As we’re planning for LTE we need to set the UE parameters, as I’m planning for a mobile network, I’ll need to set the service type and terminal.
Reconfiguration
Now we’ve defined the Objectives, it’s now time to define what values ACP can mess with to try and achieve these objectives, for some ACP runs you may be adjusting tilts or azimuths, swapping out antennas, etc, but today we’re looking for where we can put cell sites to be the most effective to serve our target area.
Now we import our candidate list. This might be a list of potential towers you can use, or in my case, for something greenfield, I’m just importing a list of points on a map every X meters to find the best locations to place towers.
From the “Reconfiguration”, we’ll select “Setup” to add the sites we want to evalute.
Atoll has “Automatic Candidate Positioning” which allows it to generate pins on the map, but I’ve not had any luck with it, instead I’m importing a list of candidates I’ve generated via a little Python script, so I’ll select “Import from File”.
Pick my file and set the parameters for importing the data like so.
Now we’ve got candidates for cell sites defined, we set the station template to populate and then we’re good to go.
Running ACP
Once you’ve tweaked all your ACP values as required, we can run the ACP job,
As ACP runs you’ll see a graph showing the objectives and the levels it needs to reach to satisfy them, this step can take a super dooper long time – Especially if your computation zone is large or your number of candidates is large.
But eventually we’ll be a lot older and wearier, but ACP will have completed, and we can checkout the Optimization it’s created.
In my case the objectives failed to be met, but that’s OK for me,
One it’s completed the Changes tab outlines the recommended changes, and the Objectives outlines how this has performed against the criteria we outlined at the start, and if we’re happy with the result, we can Commit the changes to put them on the map from the commit tab.
With that done I weed out the sites in impractical locations, the the ones in the sea…
Now we’ve got the sites plugged in, the next thing we’ll start doing is optimizing them.
When we’re dealing with greenfield builds like we are today, the “Move to highest location with X Meters” function is super useful. If you’ve got a high point on a property, we want to build our tower on the highest point, so the tower is moved to the highest point.
One thing to note is this just plans our grid. It won’t adjust azimuths, downtilts, etc, in one operation. We need to use another ACP operation to achieve that, and that’s the content of a different post!
Had an interesting fault come across my desk the other day; calls were failing when the called party (an SSP we talk to via SS7/ISUP) had an exchange based call forward in place.
We’re a SIP based network, but we do talk some SS7/ISUP on the edges, and it was important that we handled this correctly.
I could see in the Address Complete Message (ACM) sent back to our network that there was redirection information here:
We would see the B party SSP release the call as soon as it sent this.
This made me wonder if we, as the originating network, were supposed to redirect to the new B party and send a new Initial Address Message?
After a lot of digging in the ITU Q.7xx docs (I’m not where near as fast at finding information in specs written prior to my birth, than I am with the 3GPP specs) I found my answer – These headers are informational only, the B party SSP is meant to re-target the message, and send us an Alerting or Answer message when it’s done so.
Recently one of our customers who’s got a large number of Calix E7 ONTs needed some help to automate some of the network management tasks to do with the CPEs.
We’d setup an TR-069 Auto Configuration Server (ACS) for the Calix RGs (The modems) so that we could manage the config parameters on the devices.
Setup was suprisingly easy, after installing some god-awful 90’s Java stuff to access Calix’s “CMS” we pointed everything at our ACS (Per screenshot below) and presto, a few thousand CPEs were there ready to be centrally managed.
CAMEL is primarily focused on charging for Voice & SMS services, as data generally uses Diameter, so it’s voice and SMS we’ll focus on.
CAMEL is spoken between the MSC (gsmSSF) and the OCS (gsmSCF).
Basic Call State Model
CAMEL is closely related to the Intelligent Network stuff on the 1980s, and steals a lot of it’s ideas from there, unfortunately if you’re to read the CAMEL standard it also implies you were involved in IN stuff and had been born at that point, alas I was neither.
So the key to understanding CAMEL is the Basic Call State Model (BCSM) which is a model of all the different states a call can be in, such as ringing, answered, abandoned, call failed, etc, etc.
Over CAMEL, our OCS can be told by the MSC when a certain event happens; the MSC can tell the OCS, that the call has changed state. For example a BCSM event might indicate the call has hung up, is ringing, cancelled, etc.
Below is the list of all the valid BCSM states:
List of BCSM states for events
Basic MO Call with CAMEL
Our subscriber makes an outbound call.
Based on the data the MSC has in it from the HLR, it knows that we should use CAMEL for this call, and it has the SCCP Address of the OCS (gsmSCF) it needs to send the CAMEL messages to.
So the MSC sends an InitialDP message to the OCS (via it’s Global Title Address) to Authorize the call that the user is trying to make.
This is like any other Authorization step for an OCS, which allows the OCS to authorize the call by checking the subscriber is valid, check if they’re allowed to call that destination and they’ve got the balance to do so, etc.
initialDP message from an MSC to an OCS
The initialDP (Initial Detection Point) is telling our OCS all about the call event that’s being requested, who’s calling, what number they’ve dialed, where they are in the network (of note especially if they’re roaming), etc, etc.
Generally the OCS also uses this message as a chance to subscribe to BCSM Events using RequestReportBCSMEventArg so the OCS will get notified by the MSC when the state of the call changes. This means the MSC will tell us when the state of the call changes; events like the call getting answered, disconnected, etc. This is critical so we know when the call gets answered and hung-up, so we can charge correctly.
In the below example, as well as sending the Continue and RequestReportBCSMEventArg the OCS is also setting the ChargingArgs for this call, so the MSC knows who to charge (the caller) set via sendingSide and that the MSC must send an Apply Charging Report (ACR) messages every 300 units (1 unit = 100 ms, so a value of 300 = 300 x 100 milliseconds = 30 seconds) so the OCS keeps track of what’s going on.
continue sent by the OCS to the MSC, also including reportBCSMEvent and applyCharging messages
Or in a slightly less appropriate analogy but easier to understand for SIP folks, the InitialDP is sent for INVITE and the 180 RINGING is sent once the continue message is received.
Call is Answered
So at this stage our call can start to ring.
As we’ve subscribed to BCSM events in our last message, the MSC is going to tell us when the call gets answered or the call times out, is abandoned or the sun burns out.
The MSC provides this info a eventReportBCSM, which is very simple and just tells us the event that’s been triggered, in the example below, the call was answered.
eventReportBCSM from MSC to OCS
These eventReportBCSM are informational from the MSC to the OCS, so the OCS doesn’t need to send anything back, but the OCS does need to mark the call as answered so it can start timing the call.
At this stage, the call is connected and our two parties are talking, but our MSC has been told it needs to send us applyChargingReports every 30 seconds (due to the value of 300 in maxCallPeriodDuration) after the call was connected, so the MSC sends the OCS it’s first applyChargingReport 30 seconds after the call was answered:
applyChargingReport sent by the MSC to the OCS every reporting period
We can calculate the duration of the call so far based on the time of the eventReportBCSM, then the OCS must make a decision of if it should allow the call to continue or not.
For simplicity’s sake, let’s imagine we’re still got a balance in the OCS and the OCS wants the call to continue, the OCS send back an applyCharging message to the MSC in response, and includes the current allowed maxCallPeriodDuration, keeping in mind the value is x100 and in nanoseconds (so this is 30 seconds).
applyCharging from the OCS back to the MSC
Perfect, our call is good to go for another 30 more seconds, son in 30 seconds we’ll get another ACR messages from MSC to the OCS to keep it abreast of what’s going on.
Now one of two things is going to happen, either subscriber is going to burn through all of their minutes, and get their call cutoff, or the call will end while they’ve still got balance, let’s look at both scenarios.
Normal Hangup Scenario
When the call ends, we get an applyChargingReport from the MSC to the OCS.
As we’ve subscribed to reportBCSMEvent we get both the applyChargingReport with legActive: False` so we know the call has hungup, and we’ve got an event report to tell us more about the event, in this case a hangup from the Originating Side.
reportBCSMEvent and applyChargingReport Sent by the MSC to the OCS to indicate the call has ended, note the legActive flag is now false
Lastly the OCS confirms by sending a releaseCall to the MSC, to indicate all legs should now terminate.
releaseCall Sent by OCS to MSC at the very end
So that’s it!
Obviously there are other flows, such as running out of balance mid-call, rejecting a call, SMS and PBX / VPN services that rely on CAMEL, but hopefully you now understand the basics of how CAMEL based charging looks and works.
If you’re looking for a CAMEL capable OCS or a CAMEL to Diameter or API gateway, get in touch!
Ask anyone in the industry and they’ll tell you that GTPv2-C (aka GTP-C) uses port 2123, and they’re right, kinda.
Per TS 129.274 the Create Session Request should be sent to port 2123, but the source port can be any port:
The UDP Source Port for a GTPv2 Initial message is a locally allocated port number at the sending GTP entity.
So this means that while the Destination Port is 2123, the source port is not always 2123.
So what about a response to this? Our Create Session Response must go where?
Create Session request coming from 166.x.y.z from a random port 36225 Going to the PGW on 172.x.y.z port 2123
The response goes to the same port the request came on, so for the example above, as the source port was 36225, the Create Session Response must be sent to port 36225.
Because:
The UDP Destination Port value of a GTPv2 Triggered message and for a Triggered Reply message shall be the value of the UDP Source Port of the corresponding message to which this GTPv2 entity is replying, except in the case of the SGSN pool scenario.
But that’s where the association ends.
So if our PGW wants to send a Create Bearer Request to the SGW, that’s an initial message, so must go to port 2123, even if the Create Session Request came from a random different port.
Ah, another post in my “how to make software work that was made with Java in the 1990s” post, except Calix last updated this software in 2022 – make of that what you will…
This time is Calix Management System (CMS), the Java app for managing equipment in exchanges / COs from Calix.
What is included in the Charging Rule on Gx ultimately turns into a Create Bearer Request on GTPv2-C.
But the mapping it’s always obvious, today I got stuck on the difference between a Single remote port type, and a single local port type, thinking that the Packet Filter Direction in the TFT controlled this – It doesn’t – It’s controlled by the order of your Traffic Flow Template rule.
Input TFT:
"permit out 17 from any 50000 to any"
Leads to Packet filter component type identifier: Single remote port type
Whereas a TFT of:
permit out 17 from any to any 50000
Leads to Packet filter component type identifier: Single local port type (64)
As summer reaches full swing in Australia and the level of effort I put into blog posts wains, here’s a lost of books I’m to-read or have read this year.
I can’t imagine a telecom book club being super popular, but if you’ve got any recommendations for good telecom related reads, I’d love to hear them!
The End of Telecoms History – William Webb (Read)
I read this this year, Webb is one of those folks who’s paycheck doesn’t come from shilling hardware, and he’s been pretty good at making accurate predictions and soothsaying, even when what he says upsets some.
The launch of 5G pretty much played out exactly how one of his other books (The 5G Myth) predicted, and the premise of The End of Telecoms History is that if we look at the data which suggest that bandwidth growth will not continue unabated forever, what does that mean?
I’ve a feeling there are a telecom execs quietly reading this book (while making sure that no sees them reading it) and planning for a potential future in a world of enough bandwidth to satisfy demand, and how this would impact their bottom lines and overall business model, even if outwardly everyone still claims the growth will continue forever.
The Iron Wire: A novel of the Adelaide to Darwin telegraph line – Garry Kilworth (Read)
A fun imagined romp about adventures in the bush while connecting a nation in the 18th century, the story is inspired by the real world events but are fictional, it’s a fun way to explore the topic and add bushrangers into the mix.
Rogers v. Rogers: The Battle for Control of Canada’s Telecom Empire – Alexandra Posadzki (Read)
Just finished this; I’ve worked with a lot of operators in the past, both big some small (the best ones are small), and it’s fascinating to understand at a board level how things get done in telecom giants, even if the Rogers’ family aren’t the best example of how to do this…
Chip War: The Fight for the World’s Most Critical Technology – Chris Miller (Read)
Without integrated circuits the telecom industry is back to relays and electromagnetically switching traffic (not that I’m against this).
Miller’s book outlines how we got to our current situation, and how the products coming out of TSMC and SMIC will shape the future of tech at a fundamental level.
How the World Was One – Arthur C Clarke (To Read)
Famed science fiction writer Arthur C Clarke had a penchant for scuba diving and communications (can relate) hence his interest in submarine telephony.
I read “Voice Across the Sea” a few years ago (on an actual paper based book no less!) but this is freely available as an eBook and I’m looking forward to reading it.
Introducing Elixir – Simon St. Laurent & J. David Eisenberg (Reading)
The dev team at Omnitouch are all about Elixir, and being an old dinosaur I figured I should at least learn the basics!
I’m still working my way through the book, having a folder of examples typed out from the book (I can’t learn through copy / paste!), enjoying it so far, even if I’m slower than I’d like.
Adventures in Innovation: Inside the Rise and Fall of Nortel – John Tyson
My first job was with Nortel, so I’ve got a bit of a soft spot of the former Canadian telecom behemoth, and never felt I’d had a satisfactory explanation as to where it all went wrong. I got this book expecting a bit more insight into the fall part, but this book gave an interesting account as to the design of things I’d never put much thought into before.
The Real Internet Architecture: Past, Present, and Future Evolution – Zave, Pamela;Rexford, Jennifer; (To Read)
This came from a recommendation on Twitter, I know almost nothing about it other than that, but I’m keen to dig into this.
Burn Book: A Tech Love Story – Kara Swisher (Read)
A fun insight into the life and times of the big tech.
The 6G Manifesto – William Webb
There’s a Simpsons’ scene where Lisa is buying an Al Gore book named “Sane Planning, Sensible Tomorrow” and says “I hope it’s as exciting as his other book, ‘Rational Thinking, Reasonable Future'”.
I can’t help but feel Webb’s books are kinda like this (in a good way).
Realism is so important; staying grounded in reality is critical. Operators who go chasing fairy tales of driving higher ARPUs with wacky ideas with no business case or demand from end customers (and generally pushed by vendors, rather than operators) will struggle to remain viable in the future if they pour all their cash into things that won’t see a return, so I’m looking forward to reading some sane ideas as to how to approach the unnecessary Gs.
Stumbled across these the other day, while messing around with some values on our SMSc.
Setting the Data Coding Scheme to 16 with GSM7 encoding flags the SMS as “Flash message”, which means it pops up on the screen of the phone on top of whatever the user is doing.
While reading a quality telecom blog bam! There’s the flash SMS popping over whatever I was reading.
Oddly while there’s plenty of info online about Flash SMS, it does not appear in the 3GPP specifications for SMS.
Turns out they still work, move over RCS and A2P, it’s all about Flash messages!
There’s no real secret to this other than to set the Data Coding Scheme to 16, which is GSM7 with Flash class set. That’s it.
Obviously to take advantage of this you’d need to be a network operator, or have access to the network you wish to deliver to. Recently more A2P providers are filtering non vanilla SMS traffic to filter out stuff like SMS OTA message or SIM specific messages, so there’s a good chance this may not work through A2P providers.
One of the guys at work asked a seemingly simple question, is the PLMN with MCC 505 and MNC 57 the same as MCC 505 MNC 057 – It’s on 6 octets after all.
So is Mobile Network Code 57 the same as Mobile Network Code 057 in the PLMN code?
The answer is no, and it’s a massive pain in the butt.
All countries use 3 digit Mobile Country Codes, so Australia, is 505. That part is easy.
The tricky part is that some countries (Like Australia) use 2 digit Mobile Network Codes, while others (Like the US) use 3 digit mobile network codes.
Why would you do this? Why would a regulator opt to have 1/10th the addressable size of network codes – I don’t know, and I haven’t been able to find an answer – If you know please drop a comment, I’d love to know.
That’s all well and good from a SIM perspective, but less useful for scenarios where you might be the Visited PLMN for example, and only see the IMSI of a Subscriber.
We worked on a project in a country that mixed both 2 digit and 3 digit Mobile Network Codes, under the same Mobile Country Code. Certain Qualcomm phones would do very very strange things, and it took us a long time and a lot of SIM OTA to resolve the issue, but that’s a story for another day…
We’ve got a web based front end in our CRM which triggers Ansible Playbooks for provisioning of customer services, this works really well, except I’d been facing a totally baffling issue of late.
Ansible Plays (Provisioning jobs) would have variables set that they inherited from other Ansible Plays, essentially if I set a variable in Play X and then ran Play Y, the variable would still be set.
I thought this was an issue with our database caching the play data showing the results from a previous play, that wasn’t the case.
Then I thought our API that triggers this might be passing extra variables in that it had cached, wasn’t the case.
In the end I ran the Ansible function call in it’s simplest possible form, with no API, no database, nothing but plain vanilla Ansible called from Python
# Run the actual playbook
r = ansible_runner.run(
private_data_dir='/tmp/',
playbook=playbook_path,
extravars=extra_vars,
inventory=inventory_path,
event_handler=event_handler_obj.event_handler,
quiet=False
)
And I still I had the same issue of variables being inherited.
So what was the issue? Well the title gives it away, the private_data_dir parameter creates a folder in that directory, called env/extravars which a JSON file lives with all the vars from all the provisioning runs.
Removing the parameter from the Python function call resolved my issue, but did not give me back the half a day I spent debugging this…
The Data Coding Scheme (DCS or TP-DCS) header in an SMS body indicates what encoding is used in that message.
It means if we’re using UCS-2 (UTF16) special characters like Emojis etc, in in our message, the phone knows to decode the data in the message body using UTF, because the Data Coding Scheme (DCS) header indicates the contents are encoded in UTF.
Likewise, if we’re not using any fancy characters in our message and the message is encoded as plain old GSM7, we set set the DCS to 0 to indicate this is using GSM7.
From my experience, I’d always assumed that DCS0 (Default) == GSM7, but today I learned, that’s not always the case. Some SMSc entities treat DCS0 as Latin.
Let me explain why this is stupid and why I wasted a lot of time on this.
We can indicate that a message is encoded as Latin by setting the DCS to 0x03:
We cannot indicate that the message is encoded as GSM7 through anything other than the default alphabet (DCS 0).
Latin has it’s own encoding flag, if I wanted the message treated as Latin, I’d indicate the message encoding is Latin in the DCS bit!
I spent a bunch of time trying to work out why a customer was having issues getting messages to subscribers on another operator, and it turned out the other operator treats messages we send to them on SMPP with DCS0 as Latin encoding, and then cracks the sads when trying to deliver it.
The above diff shows the message we send (Right), and the message they dry to deliver (left).
I started seeing this error the other day when running CDRsv1.GetCDRs on the CGrateS API:
SERVER_ERROR: unexpected end of JSON input
It seemed related to certain CDRs in the cdrs table of StoreDB.
After some digging, I found the stupid simple problem:
I’d written too much data to extra_fields, leading MySQL to cut off the data mid way through, meaning it couldn’t be reconstructed as JSON by CGrateS again.
Like the rounding issue I had, this wasn’t an issue with CGrateS but with MySQL.
Quick fix:
sudo mysql cgrates -e "ALTER TABLE cdrs MODIFY extra_fields LONGTEXT;"
And new fields can exceed this length without being cut off.
Like a lot of companies, we’re moving away from VMware, and in our case, shifting to Proxmox.
But that doesn’t mean we can get entirely away from VMware, but more that it’s not our hypervisor of choice anymore, and this means shifting our dev environments and lab off VMware to Proxmox first.
So today I sat down to try and shift everything to Proxmox, while keeping the VMware based VMs accessible until they can slowly die of bitrot.
A sane person would probably utilize Proxmox’s fancy new tool for migrating VMs from VMware to Proxmox, and it’s great, but in our case at least, it required logging into each VM and remapping NICs, etc, which is tricky on boxes I don’t have access to – Plus we need to keep some VMware capability for testing / labbing stuff up.
So I decided into install Proxmox onto the bare metal servers, and then create a VMware virtual machine inside the Proxmox stack, to host a VMware ESXi instance.
I started off inside VMware (Before installing any Proxmox) by moving all the VMs onto a single physical disk, which I then removed from the server, so as to not accidentally format the one disk I didn’t want to format.
Next I nuked the server and setup the new stack with Proxmox, which is a doddle, and not something I’ll cover.
Then I loaded a VMware ISO into Proxmox and started setting up the VM.
Now, nested virtualization is a real pain in the behind.
VMware doesn’t like not being run on bare metal, and it took me a good long amount of time to find the hardware config that I could setup in Proxmox that VMware would accept.
Create the VM in the Web UI; I found using a SATA drive worked while SCSI failed, so create a SATA based LVM image to use, and mount the datastore ISO.
Then edit /etc/pve/qemu-server/your_id.conf and replace the netX, args, boot and ostype to match the below:
Now you can go and start the VM, but once you’ve got the VMware splash screen, you’ll need to press Shift + O to enter the boot options.
At the runweasle cdromBoot after it add allowLegacyCPU=true– This will allow ESXi to use our (virtual) CPU.
Next up you’ll install VMware ESXi just like you’ve probably done 100 times before (is this the last time?), and once it’s done installing, power off, we’ll have to make few changes to the VM definition file.
Then after install we need to change the boot order, by updating:
boot: order=sata0
And unmount the ISO:
ide2: none,media=cdrom
Now remember how I’d pulled the hard disk containing all the VMware VMs out so I couldn’t break it? Well, don’t drop that, because now we’re going to map that physical drive into the VM for VMware, so I can boot all those VMs.
I plugged in the drive and I used this to find the drive I’d just inserted:
fdisk -l
Which showed the drive I’d just added last, with it’s VMware file system.
So next we need to map this through the VM we just created inside Proxmox, so VMware inside Proxmox can access the VMware file system on the disk filled with all our old VMware VMs.
VM. VM. VM. The word has lost all meaning to me at this stage.
We can see the mount point of our physical disk; in our case is /dev/sdc so that’s what we’ll pass through to the VM.
And now, if everything has gone well, after logging into the Web UI, you’ll see this:
Then the last step is going to be re-registering all the VMs, you can do this by hand, by selecting the .vmx file and adding it.
Alternately, if you’re lazy like me, I wrote a little script to do the same thing:
[root@localhost:~] cat load_vms3.sh
#!/bin/bash
# Datastore name DATASTORE="FatBoi/"
# Log file to store the output LOG_FILE="/var/log/register_vms.log"
# Clear the log file > $LOG_FILE
echo "Starting VM registration process on datastore: $DATASTORE" | tee -a $LOG_FILE
# Check if datastore directory exists if [ ! -d "/vmfs/volumes/$DATASTORE" ]; then echo "Datastore $DATASTORE does not exist!" | tee -a $LOG_FILE exit 1 fi
# Find all .vmx files in the datastore and register them find /vmfs/volumes/$DATASTORE -type f -name "*.vmx" | while read VMX_PATH; do echo "Registering VM: $VMX_PATH" | tee -a $LOG_FILE vim-cmd solo/registervm "$VMX_PATH" | tee -a $LOG_FILE done
echo "VM registration process completed." | tee -a $LOG_FILE
[root@localhost:~] sh load_vms3.sh
Now with all your VMs loaded, you should almost be ready to roll and power them all back on.
But before we reboot the Hypervisor (Proxmox) we’ll have to reboot the VMware hypervisor too, because here’s something else to make you punch the screen:
Luckily we can fix this one globaly.
SSH into the VMware box, edit /etc/vmware/config.xml file and add:
vhv.enable = "FALSE"
Which will disable the performance counters.
Now power off the VMware VM, and reboot the Proxmox hypervisor, when it powers on again, Proxmox will allow nested virtualization, and when you power back on the VMware VM, you’ll have performance counters disabled, and then, you will be done.
Yeah, not a great use of my Saturday, but here we are…
S8 Home Routing is a really simple concept, the traffic goes from the SGW in the visited PLMN to the PGW in the home PLMN, so the PCRF, OCS/OFCS, IMS, IP Addresses, etc, etc, are all in the home network, and this avoids huge amounts of complexity.
But in order for this to work, the visited network MME needs to find the PGW of the home network, and with over 700 roaming networks in commercial use, each one with potentially hundreds of unique APNs each routing to a different PGW, this is a tricky proposition.
If you’ve configured your PGW peers statically on your MME, that’s fine, but it doesn’t scale very well – And if you add an MVNO who wants their own PGW for serving their APN, well you’ll be adding some complexity there to, so what to do?
Well, the answer is DNS.
By taking the APN to be served, the home PLMN and the interface type desired, with some funky DNS queries, our MME can determine which PGW should be selected for a request.
Let’s take a look, for a UE from MNC XXX MCC YYY roaming into our network, trying to access the “IMS” APN.
Our MME knows the network code of the roaming subscriber from the IMSI is MNC XXX, MCC YYY, and that the UE is requesting the IMS APN.
So our MME crafts a DNS request for the NAPTR query for ims.apn.epc.mncXXX.mccYYY.3gppnetwork.org:
Because the domain is epc.mncXXX.mccYYY.3gppnetwork.org it’s routed to the authoritative DNS server in the home network, which sends back the response:
We’ve got a few peers to pick from, so we need to filter this list of Answers to only those that are relevant to us.
First we filter by the Service tag, whihc for each listed peer shows what services that peer supports.
But since we’re looking for S8, we need to find a peer who’s “Service” tag string contains:
x-3gpp-pgw:x-s8-gtp
We’re looking for two bits of info here, the presence of x-3gpp-pgw in the Service to indicate that this peer is a PGW and x-s8-gtp to indicate that this peer supports the S8 interface.
A service string like this:
x-3gpp-pgw:x-s5-gtp
Would be excluded as it only supports S5 not S8 (Even though they are largely the same interface, S8 is used in roaming).
It’s also not uncommon to see both services indicated as supported, in which case that peer could be selected too:
x-3gpp-pgw:x-s5-gtp:x-s8-gtp
(The answers in the screenshot include :x-gp which means the PGWs advertised are also co-located with a GGSN)
So with our answers whittled down to only those that meet our needs, we next use the Order and the Preference to pick our best candidate, this is the same as regular DNS selection logic.
From our candidate, we’ve also got the Regex Replacement, which allows our original DNS request to be re-written, which allows us to point at a single peer.
In our answer, we see the original request ims.apn.epc.mncXXX.mccYYY.3gppnetwork.org is to be re-written to topon.lb1.pgw01.epc.mncXXX.mccYYY.3gppnetwork.org.
This is the FQDN of the PGW we should use.
Now we know the FQND we should use, we just do an A-Record lookup (Or AAAA record lookup if it is IPv6) for that peer we are targeting, to turn that FQDN into an IP address we can use.
And then in comes the response:
So now our MME knows the IP of the PGW, it can craft a Create Session request where the F-TEID for the S8 interface has the PGW IP set on it that we selected.
For more info on this TS 129.303 (Domain Name System Procedures) is the definitive doc, but the GSMA’s IR.88 “LTE and EPC Roaming Guidelines” provides a handy reference.
Want more telecom goodness?
I have a good old fashioned RSS feed you can subscribe to.