Christoph
Oberhofer

The Making of Switch - Part 4: GTFS Timetable

Once I started collecting movement data, I shifted my data gathering efforts towards the second phase, the train schedule. As already identified in Part 2 (Foundations of a location-based app) the timetable is the other, still remaining, ingredient for reliable journey matching.

First, let’s revisit the thought experiment from an earlier chapter, where I tried to figure out what train I was on, by just relying on the departure monitors at the stations. That experiment yielded the following data needs:

Remember, one of the goals was to build an experience that didn’t require internet access, therefore the schedule had to be available for offline use. That meant the entire timetable needed to live on the user’s device, in one form or another.

Searching for a timetable

Before I was able to decide on a data format, I wanted to find a reliable source for timetable data, to make sure I’m not maneuvering myself into a corner. To keep the scope manageable, I narrowed down the search to only cover the railway network in Austria.

For those of you who are not familiar with the European train schedule: Timetables only change once a year, across Europe, in unison. This helps a lot keeping things stable and predictable. This is also beneficial for transit data providers, where a single schedule is valid for the entire year. Of course, there are occasional updates, but 99% of the schedule typically stays the same throughout the year.

The first, and most obvious, source was the Open Data section of the Austrian Railways. They offer GTFS data of the current, and previous timetable years. This turned out to be a good starting point, despite minor drawbacks. During the first inspection of the files I realized that something was missing. There was no sign of other agencies and their trains running on Austrian infrastructure. This limitation meant that cross-border trains operated by other agencies such as Deutsche Bahn, were excluded from that schedule. This was a limitation I could live with, but the absence of WESTbahn, the second largest passenger train operator in Austria, made the decision easier. I had to find another source.

After a little while I discovered the non-profit called Mobilitätsverbünde Österreich. They claimed to offer more comprehensive data than ÖBB, including most train agencies operating in Austria. A download later, I inspected the files and they passed the initial check for what I needed.

What is GTFS?

With our timetable data on disk, it’s time to explore the GTFS format and discuss how it relates to the initial problem statement. GTFS is short for General Transit Feed Specification and is an open standard for transit information targeting the passenger. The detailed specification is available here.

The standard provides detailed specifications for almost every aspect of public transport. Because it’s easy to get overwhelmed with the entire offering, narrowing down the actual use-cases is the best way to stay sane. Early on I identified the following elements to unlock Switch’s future capabilities:

ER Diagram

The above ER diagram visualizes the relationship between each of the elements in the specification, but narrowed down to the Switch use-case. At the center of it, trips.txt ties everything together, its service days are defined by calendar.txt the stop_times.txt define departure/arrival times for each stop along the way, and finally the shapes.txt restricting the path the train runs on.

stops.txt

As I’ve already pointed out multiple times in the previous chapters, one of the most critical datapoints are the departure and arrival stations. Knowing that, and the fact that trains only travel from one stop to another, the location and name of each station was an important reference to have. The stops.txt file contains all stops relevant to the trains running that timetable year.

The properties relevant for Switch include: stop_id, stop_name, stop_lat, stop_lon, location_type and parent_station. To be clear, a stop doesn’t always refer to the actual station, but can be a platform instead. The parent_station attribute indicates that relationship, and ties those two together. For the purpose of Switch, I decided to only use the actual station, simply because the low accuracy of the location readings didn’t allow me to target individual platforms.

With this data in place, Switch gained the ability to query nearby stations using a reference location and radius. For example:

Resulting in:

stop_idnamelatitudelongitudetype
Pat:49:1015Wien Meidling48.1742743416.331901871

Knowing where I was got me another step closer to replicating a departure monitor, a key milestone towards full journey matching.

Timetable

The next step was to find trips that depart from a specific station. Given the ER diagram above, the relationships between stop_times.txt, trips.txt and stops.txt seemed promising.

One needs to be careful; the stop_times.txt file references platforms, not stations, which forced me to always include all platforms for a particular station. With that hurdle behind me, I was able to list all trips stopping at that station. Since this result set covered the entire timetable year, it had to be further narrowed down, based on the current time of day, and the date. Luckily, every location reading has a timestamp attached.

First, when dealing with the time of day, one needs to understand that trains almost never leave before their planned departure, but are rather delayed by a few minutes or more. This is why I decided to define a time-window spanning -5 to +30 minutes, leaving enough wiggle room for potential delays.

Next up was the date itself, which was required to determine whether a train was running that day or not. Again, when looking at the ER diagram, there are two relevant connections: First, the calendar_dates.txt defining specific days within the year when trains run, and second the calendar.txt indicating the weekdays respectively. Only then I was able to accurately determine the departures.

The following pseudo SQL code may provide further clarity:

Select *
FROM stop_times st
  JOIN stops ON stops.stop_id = st.stop_id
  JOIN trips ON trips.trip_id = st.trip_id
  JOIN calendar c ON c.service_id = trips.service_id
  JOIN calendar_dates cd on cd.service_id = trips.service_id
WHERE
  stop_id = "Pat:49:1015"      -- Wien Meidling
  AND c.date >= "2026-02-04"   -- Today
  AND c.date < "2026-02-05"    -- Tomorrow
  AND cd.date IS NULL          -- No exceptions
  AND st.departure >= "8:35 am"-- 8:40 - 5 min
  AND st.departure <= "9:10 am"-- 8:40 + 30 min

When running this query, we receive a list of all trips stopping at Wien Meidling, on Feb 4th, departing between 8:35am and 9:10am. I went ahead and compared the results with the official departure monitor, and was pleased to see they (almost) matched. On the first screenshot, I captured the official departure monitor, whereas the second one shows the result of the above query.

Official Timetable GTFS Timetable

When comparing those two screenshots, one might notice subtle differences in the data. First, the official listing includes a real-time departure column (Aktuell) which is obviously missing in our data. Secondly, the destinations (Nach) for REX6 and REX65 do not match. This is due to the way train trips are structured, where a single train may connect multiple trips. Luckily, the block_id property on those trips allowed me to connect those trains.

Block-IDs

Actually, the first time I started noticing the differences between official timetables and what I was seeing in Switch, was on a trip from Vienna to Hallstatt. Once a day, you can get there on a direct train, without the need to catch a connection. However, when looking up this trip in GTFS, I noticed that this trip terminated early in Linz.

trip_idblock_idnameheadsigndeparturestopdistance
18.TA2071IC 1118Stainach07:45Wien0
18.TA2071IC 1118Stainach09:00Linz181

I didn’t understand why this was happening, so I started scanning the data for other mentions of the same trip-name IC 1118. To say the least, I was surprised to see another entry coming up, but with a different trip_id.

trip_idblock_idnameheadsigndeparturestopdist
1.TA2071IC 1118Stainach09:04Linz0
1.TA2071IC 1118Stainach10:56Hallstatt119

When I started overlapping the data, it clicked. The trip_id may be different, but the block_id is the same. Additionally, the arrival stop of the first trip matches the departure stop of the second trip, confirming my theory. Of course, this made things a little more complicated for journey detection, because I needed the entire trip to be one continuous journey.

This is when I decided to abstract the block-id concept and combined all trips within a block into a single trip, which also meant combining and offsetting the information presented in shapes.txt and stop_times.txt, like distance and sequence. As seen above, both trips start at 0, and continue counting from there, which wasn’t ideal. Another problem was that trip_id was a foreign key in a bunch of other files complicating things even further. This required carefully threading the needle and using an abstract GtfsId concept in place of a simple trip-id string.

From that moment on, a trip-id was no longer a simple string, but an instance of either a GtfsSingleTripId or GtfsBlockTripId. When applied to the trip discussed above, the trip-id would result in a GtfsBlockTripId matching 2071|18.TA|1.TA.

Rail Network

For people not familiar with the geography of Austria and its rail network, may have noticed the wide variety of destinations displayed on the departure monitor. While Bregenz is to all the way west, Bratislava is to the east of Vienna. So how do we know which train we are on? I started out by combining stop_times.txt and their respective shapes.txt.

In terms of file size, the shapes.txt file is by far the largest one among its siblings, containing over a million entries, and that just for the Austrian schedule. Each entry represents a point on a path the train travels along. In addition to geo-information, a dist_traveled property indicates how far along we are on the trip.

To give you a better idea of what the contents of the file looks like, here’s an excerpt matching the train mentioned above (Vienna - Hallstatt)

idseqdist_traveledlatitudelongitude
1.4.H1048.1967006916.33655568
1.4.H216848.1962105216.33440440
1.4.H319348.1961341116.33409313
1.4.H421948.1960438116.33376336

Looking at the dist_traveled column, we can determine where we are along a trip (in meters), by a given pair of coordinates. This is a very helpful datapoint when calculating the likelihood of being on a specific train. For example you can rule out a trip if the distance traveled decreases over time, which would mean a train going backwards.

To further illustrate the usefulness of shapes, here are two screenshots showing the possible trips from Vienna (Meidling) and Graz, given a 35 minute time window.

Departures Vienna Departures Graz

As seen in the example of Vienna, there are quite a lot of trains passing through a single station in just 35 minutes, and that’s not even the main railway station. The more trains, the more ambiguity and complexity arises during journey matching.

Data import

Once the data was on my developer machine, and its structure well understood, it was time to get it loaded onto the app. As already discussed in Part 2: A location-based app, I decided to store all scheduling information in a SQLite database. Given the relational nature of the source data, this made the most sense. Querying scheduling information became a SQL exercise, which was an easy one to solve. While I played with the thought of creating an entirely different data model in SQL, I decided to stick very closely to GTFS, since the working group behind the specification has already done the heavy lifting.

The only thing I truly worried about was the size of the shapes.txt file, which contained over 1 million entries, posing quite a challenge. I was only able to solve this with reasonable performance after completing the following steps:

First, I tested the most naive approach, using the expo-sqlite module. This was unbearably slow, taking tens of minutes to import all shapes entries. After careful reconsideration, the next step was to reduce the density of the shapes data. Instead of using every single entry, I only kept entries that are at least 1km apart from each other. This resulted in about a 10 times reduction in data, without losing too much accuracy. But even that improvement only shaved off a couple of minutes, still taking too long to process.

The next step was using batch inserts, combining 1,000 inserts into a single statement. This change had some positive impact on speed, but in turn sacrificed the stability of the app, causing it to crash from time to time.

This was the moment I decided to switch from TypeScript to platform native code, and moved the parser and importer to Kotlin. Even though batched inserts weren’t supported by Android, the time to import the shapes was reduced to just 10 seconds. This was a reasonable compromise.

Geospatial data

The nature of the data, specifically the stops and shapes, would have benefited from a geospatial database, but unfortunately the SQLite binary shipped with Android does not come equipped with R-Tree support (see lack of flag here Android.bp). This forced me to handle geo-data a little differently.

In order to find train stations within a location’s radius, I decided to load all stations into memory, and create an r-tree inspired geokdbush for quick lookups. This approach helped to keep load and access times to a minimum.

Annual Timetable Change

It comes as no surprise that changes to timetables happen over time, therefore a product relying on it needs to adapt with it. When I started working on Switch back in July, I wasn’t necessarily thinking about updating the schedule, and pushed it off as a future problem. With December creeping closer, I was forced to find a solution. Why December? As I already mentioned above, every year in December, the timetable for rail traffic is updated in most of Europe.

My initial implementation wasn’t keeping track of the version of a timetable, making it impossible to just import a new schedule, since previous versions needed to stick around. The easiest approach was to create separate SQLite database files for each timetable year, and that’s exactly what I did. Other solutions such as version columns, and others seemed to overcomplicate things, especially when dealing with migrations and testing.

However, the current approach has its limits too and consumes a lot of storage on the device, with each timetable weighing around 200 MB. One way to optimize storage for past timetables is to only keep around the journeys the user has taken, and discard all other data. Since this is an annual problem, I still have time until December 2026 to find a more elegant solution to this problem.

Data Quality

One aspect I noticed, especially with the timetable change, is that the quality of the GTFS files varies. First off, I have to say that we are pretty lucky here in Austria to have access to this sort of data for free. Looking across the border to Germany, I haven’t found a complete timetable, since none of the available datasets shipped with shapes (tracks), which made them useless for tracking purposes.

However, I’ve found several issues in the Austrian GTFS dataset, like platforms that have no parent station, or trips that weren’t correctly linked to their block.

The following table lists all stops, stations and platforms, related to the Semmering location, according to the timetable in 2025.

stop_idnametypeparent_station
at:43:8769:0:2SemmeringNULLPat:43:8769
at:43:8769:0:3SemmeringNULLPat:43:8769
Pat:43:8769Semmering1NULL

When I imported data for the 2026 timetable year, I was surprised to see that the parent location disappeared, leaving the two platforms disconnected. See below

stop_idnametypeparent_station
at:43:8769:0:2SemmeringNULLNULL
at:43:8769:0:3SemmeringNULLNULL

The lack of parent stations is quite unfortunate, because this is the only reliable way to connect the platforms together. Of course, it looks like the stop_id might be used as such as well, but it seems more brittle than useful. Although I haven’t reported any of the findings to the Mobilitätsverbünde Österreich yet, I may do so in the future.

Cross Border Timetable

Another shortcoming of the GTFS data availability is the lack of international coverage, severely limiting cross-border traffic. I treated the first few version of Switch as a feasibility study, therefore I was only concerned with the Austrian schedule initially. But as soon as I started tracking train journeys in neighboring countries, such as Germany or Slovenia, I got more interested in solving this problem too. However, due to the lack of time and the aforementioned data availability, I tabled this issue for later.

This chapter concludes the data-collection efforts, finally ready to move on to Part 5: Journey Detection.