Christoph
Oberhofer

The Making of Switch - Part 5: Journey Detection

Finally, the moment we’ve all been waiting for has arrived, the actual journey detection mechanism powering Switch. Turning the location, activity and timetable data into a journey.

Logic Event View

Throughout the past chapters, I have started collecting location and activity data, loaded the Austria train schedule and experienced first hand just how unreliable real-world data actually is. Most of my career I spent building deterministic systems that had strict specifications and input/output guarantees, but this felt different. Internalizing this wasn’t easy for me, therefore I had to change the way I was thinking about data, not just zero and ones anymore, but with many shades of floating points in between.

Humble Beginnings

During my initial research on how to build the journey detection system, I tried to replicate what I had seen in Strava, the sports-tracking app, where known segments were matched against your own recordings. However, due to the big differences between the exact GPS tracks in Strava and the inaccurate location readings in Switch, I discarded this approach early on. The other reason I abandoned this idea was driven by the lack of a geospatial data store for our path data, which was required for matching geographic shapes. Back then I realized I was getting ahead of myself, and needed to slow down to better understand the problem space I was in.

Components of a train journey

Instead of a pure geographical problem to solve, I decided to start breaking apart and defining what a train journey actually was. This was a good moment also for me to further refine the terminology and be more consistent. There are just so many ways to refer to a train journey, which in itself also carries multiple meanings. Back in Part 1: Motivation I did my best to define the terminology for Switch, so use that as a reference when in doubt.

Sequence of a regular train journey

After having traveled quite a lot on trains recently, I tried to find similar patterns during my trips, documenting them as a sequence of events:

Journey Sequence

Of course, this was an oversimplified view of a journey, but turned out good enough to get started with. When I zoomed out a little further and looked at what data I had available (movement & schedule data) and what I wanted to derive from it (train journey), I had a lightbulb moment:

This was the sequence of events I wanted to find, given the movement data (observations) I had.

Hidden Markov Model (HMM)

This was the moment I first learned about the power of Hidden Markov Models (HMM), and their applications. The state (the actual journey) is hidden, but the observations and transitions from one state to another are known. For example, I cannot magically change the train without exiting at the next stop and then re-boarding a different train. All of this can be modeled with an HMM, and then later decoded with the Viterbi algorithm, to find the actual path.

The way I wanted the journey detection to work was to find the most likely path incrementally, step by step, instead of waiting for the trip to complete. This approach allowed me to also implement features relevant during a journey, like showing which train I was most likely on, and how far that trip has already progressed. Designing the journey matcher to work step by step also meant that pre-processing was limited, where looking into the future wasn’t possible, as opposed to a batched approach.

The goal was clear, I had to find a way to map each observation to one or more hidden states, and calculate their probabilities. Imagine building a tree of hidden states that grows larger with every observation. Each hidden state can spawn multiple paths again, which then eventually end at a set criteria. A naive approach will inevitably grow the tree very large, which required me to put pruning mechanisms in place to ensure a well balanced working set.

Fundamentally, the way HMMs work, in combination with Viterbi, is pretty straightforward. First, you define your hidden states, their transition and emission probabilities. Second, for each observation, you calculate the probability of one state transitioning to another one, given the transition matrix. Third, you calculate the emission probability, given the current observation and hidden state. Once that’s done, you link the current state to the most likely previous state, creating a chain of probabilities. Traversing this chain gives you the most likely sequence of the state, revealing the hidden path.

Observations

In the case of Switch, observations are just an enriched version of movement data. To better align with the streaming data pattern, I decided to transform each location and activity record to individual events. This had the additional advantage of being able to reduce the events into a state, enhancing the journey matcher with additional context. I defined the following types of events:

eventdescription
LOCATION_UPDATEDThis event indicates a new location reading, including timestamp and coordinates
ACTIVITY_TRANSITIONEDThis event is triggered every time an activity transition (enter / exit) is detected
GEOFENCE_TRIGGEREDThis event is derived from location and GTFS data. It indicates whether a specific train station was entered or exited.

Each observation pushes the state machine forward, helping us to reveal the hidden path.

The hidden states

With the help of the real-world tracking data I collected, I attempted to create a transition matrix of some sort that helped me better understand the different states a journey can be in. Of course, this can be further refined and optimized, but the following list proved to be a good start:

statedescription
IDLEThe initial state of the system. Think of it as start of the day
AT_STATIONIndicates a close proximity to a specific train station
WAITING_AT_STATIONIndicates the person waiting at a station. Could indicate different scenarios, such as waiting on a train, or waiting for a friend to arrive, picking them up.
TRIP_STOPA more specific version of AT_STATION, indicating a stop along a train trip, including arrival and departure times.
DEPARTING_ON_TRAINA state indicating a person is leaving the station on a train, triggering a lookup in the timetable
LEAVING_STATIONWhen a person leaves a station, potentially on a train, or when they head home
SHAPEIndicates a specific section of a train trip, including estimated timetable and geographic information
OFF_TRAINIndicates a person has stepped off train, finishing their journey

Candidate Selection

To be precise, most of the states above are classes of hidden states, not the actual instances. For example, each station in the GTFS dataset is transformed to its own AT_STATION instance, with additional metadata such as geo-coordinates and name. The same goes for TRIP_STOP, which also carries schedule information for a specific train, at a specific stop. This is why the search space for the hidden states can get pretty large, growing into tens of thousands of states. This is why it’s recommended to reduce the search space for each HMM iteration, to only include the relevant candidates.

Most of the states have geo-information attached to them, which makes reducing the search space pretty easy by only including the ones within a certain radius of the observation. This downsizing helps the matcher to perform better, and also improves the ability to debug and visualize state changes.

Train simulation

Creating each of the above states was pretty straightforward with the help of GTFS data, with one exception that was SHAPE. I wanted to reduce the search space not only by geo-information, but also by schedule, therefore I needed to find a way to attach the schedule to the SHAPE. Since a SHAPE is just a 1km segment of the whole train trip, there’s no scheduling data provided by GTFS in such detail. Scheduling information not only reduces search space, but is also helps narrowing down on a specific train. Many trains run along the same tracks, and even make similar stops, but their timing and speed might just be a little different, providing us even more clues on what train we are actually on.

To solve this, I created a simulation for each train trip based on the known stop times, and then interpolate the time in between. This time estimation was then attached to each SHAPE segment, and used for probability calculation. However, this approach is highly inaccurate, since long-distance trains may not stop for hours, and vary their speed based on the terrain. But due to the lack of data availability, this was a good enough approximation to begin with. Another improvement that turned out useful was to continuously update the delay time, based on the actual time of the location readings, avoiding error multiplication.

statescheduledestimatedvisiteddelaylatlon
TRIP_STOP_111:4311:50748.19..16.33..
SHAPE_111:46748.19..16.32..
SHAPE_211:4911:591048.19..16.32..
SHAPE_311:521048.20..16.32..
TRIP_STOP_211:5512:04948.20..16.32..

The table above illustrates this behavior, where the scheduled times are only known for the actual stops, whereas shapes only carry the time estimate. Not every shape is visited during a train ride, due to the low frequency of the location readings. But when it is, then the delay is calculated based on the estimated and visited time, and then propagated to all following shapes. In the case above, the delay increases from 7 to 10 minutes, and is carried on to all subsequent shapes.

This information was then put to good use for candidate selection and calculating emission probabilities. The following pseudo-code demonstrates how the candidate selection works for SHAPE states, based on the train simulation above:

function findCandidates(observation: Event) {
  const candidates: JourneyState[] = [];
  for (const candidate of allCandidates) {
    switch (candidate.type) {
      case "SHAPE": {
        const { scheduled, delay } = candidate;
        const distance = calculateDistance(candidate, observation);
        const timeDiff = Math.abs(scheduled + delay - observation);
        if (distance < CANDIDATE_RADIUS && timeDiff < SCHEDULE_SIGMA) {
          candidates.push(candidate);
        }
      }
    }
  }
  return candidates;
}

Only SHAPE elements within a certain radius and time window are considered candidates for the current observation. This considerably reduced the candidate space, improving overall performance.

Transition Probabilities

Next up was the modeling of transition probabilities, defining the likelihood of a state to change to another. The following diagram seems complex at first, but once you dig in, it should be pretty self explanatory.

To be clear, the states visible in this diagram are grouped into classes for brevity, but in reality, each station has their own state, so do other states that are bound to a specific location. Those transition probabilities simply describe the likelihood of one state transitioning into another. You can think of it as a probabilisitic state machine.

Transition Probabilities

To provide more clarity on the above diagram, let’s go through a specific example. Imagine the current state is at AT_STATION. There are two possible state transitions (two arrows pointing outward)

  1. A transition to WAITING_AT_STATION, but only if we remain at the same station
  2. A transition to AT_STATION, remaining at the same station

This is what it looks like in code:

function transitionProb(from: TripState, to: TripState) {
  switch (from.type) {
    case "AT_STATION": {
      switch (to.type) {
        case "AT_STATION": {
          if (to.stopId === from.stopId) {
            return 1;
          }
          return 0;
        }
        case "WAITING_AT_STATION": {
          if (to.stopId === from.stopId) {
            return 1;
          }
          return 0;
        }
      }
      return 0;
    }
  }
}

Because the transition probability is 1.0 in both cases, the chain is continuing with both states. To find out which state survives, we need to compare them with our observation, and calculate their emission probabilities.

Emission Probabilities

After calculating the transition probabilities, the next step is to determine the likelihood of a state emitting the current observation. Let’s say we’ve transitioned from AT_STATION to WAITING_AT_STATION based on the probabilities above (1.0), we need a way to check, based on our observation, how likely it is to be at that station. Since each station has geo-information attached, we simply calculate the distance from our location reading to that station. The lower the distance, the higher the likelihood. By defining some lower and upper bounds, we can stabilize the system a bit and make it less prone to accuracy errors.

The following emissionProb function returns 1.0 if the current location reading is within 250m of that station, indicating that there’s a high correlation between the observation and the hidden state;

// The bigger the distance, the lower the likelihood
function stationProbability(distance: number) {
  // If distance within 250m, always return 1
  if (distance < STATION_MATCH_RADIUS) {
    return 1;
  }
  return 1 * (DISTANCE_SIGMA / (distance ** 2 + DISTANCE_SIGMA));
}

// Returns likelihood (0-1) for the candidate to emit that observation
function emissionProb(candidate: TripState, observation: Event) {
  switch (observation.type) {
    case "LOCATION_UPDATED": {
      const { location } = observation;
      switch (candidate.type) {
        case "WAITING_AT_STATION":
        case "AT_STATION": {
          const distance = calculateDistance(candidate, location);
          return stationProbability(distance);
        }
      }
    }
  }
  return 0;
}

The emission probabilities aren’t just limited to geo-distancing, but also allow us to find a similarity score to match the actual train schedule. This is required to further close in on the train we are most probably on. For example, a SHAPE or TRIP_STOP will only emit a high likelihood if the proximity is close enough, and on schedule.

With the transition and emission probabilities calculated, we can then update the backpointers and create a link for each candidate back to their most likely previous state. And then we wait for the next observation to trigger another iteration, until the exit criteria has been reached. This is what it looks like in pseudocode:

const backpointers = new Map<string, string>();
let viterbi = new Map<string, number>();
function step(observation: Event) {
  viterbi = new Map<string, number>();
  const candidates = findCandidates(observation);
  for (const candidate of candidates) {
    let maxProb = 0;
    let bestPrevState: string | null = null;
    for (const prev of prevCandidates) {
      const tProb = Math.log(transitionProb(prev, candidate));
      const totalProb = prevProb + tProb;
      if (totalProb > maxProb) {
        maxProb = totalProb;
        bestPrevState = prev;
      }
    }
    if (bestPrevState) {
      const eProb = Math.log(emissionProb(candidate, observation));
      if (eProb >= LOG_ZERO) {
        const viterbiProb = maxProb + eProb;
        viterbi.set(toState.id, viterbiProb);
        newBackpointers.set(toState.id, bestPrevState);
      }
    }
  }
  backpointers.push(newBackpointers);
}

To clarify, the calculations are performed in log-space, to avoid underflow of numbers. Typically you would multiply each probability with the next one, but the numbers would get real small real fast. Log-space turns multiplications into additions, solving this issue.

Backtracking using Viterbi

The HMM’s purpose is to drive the state machine forward, but in order to recover the most likely path taken, we need to backtrack using the Viterbi algorithm. Since each step of our progression forward carries on the probability of the last step, all we have to do is to start backtracking from the most probable end-state.

This is best explained with the following example (pruned for brevity) where a trip starts at AT_STATION_1, travels along TRAIN_3 and exits at STATION_2.

Viterbi Backpointers

Even though there were various trains to choose from in the beginning, TRAIN_1 was discarded early because no more matching SHAPES were found. Maybe because the train was running in the opposite direction. Another train named TRAIN_2 followed the same path, but may have had a different schedule, and did not stop at STATION_2. The last train, TRAIN_3 kept on going until OFF_TRAIN was reached, making it the most likely path.

function backtrack(viterbi: Map<string, number>) {
  const topState = mostProbable(viterbi);
  const resultPath: JourneyState[] = [];
  let currentState = topState;
  for (let t = backpointers.length; t >= 0; t--) {
    if (!currentState) {
      break;
    }
    if (currentState) {
      resultPath.unshift({ ...currentState });
    }
    if (t > 0) {
      currentState = backpointers[t - 1].get(currentState.id);
    }
  }
  return resultPath;
}

The beauty of using HMM and Viterbi is that all sorts of probabilistic modeling can be accomplished with it. For example, the first iteration of Switch was only able to detect individual trips, not entire journeys. By simply updating the transition matrix I was able to also account for connecting trains.

Journey Reporting

Once backtracking is finished, we can restore the journey based on the states and report it back to the user for a final review.

function detectJourneys(observations: Event[]) {
  for (const observation of observations) {
    step(observation);
  }
  const path: JourneyState[] = backtrack(viterbi);
}

With the help of our GTFS data, reconstructing the actual journey is just a matter of fetching the right data.

The current exit criteria is based on the OFF_TRAIN state, and kicks in when four consecutive OFF_TRAIN states have been reached. Although I’ve spent days and days on trying to refine or even replace the current approach with a more robust solution, I haven’t found one yet. The downside of the current approach is that it takes a long time (up until 1 hour) after leaving the train to actually reach that end state. This is due to the low frequency of location updates and the limited accuracy.

A possible alternative would be a NOT_ON_TRAIN state, and continuously track the likelihood of not being on a train, as opposed to being on a train. This has its own complications, since it seems a little decoupled from the train tracking pipeline.

Once a journey is detected, it’s reported back to the user for a final review, and then confirmed or rejected with a swipe action.

Review Trip Confirm Trip

Testing & Debugging

Updating transition or emission probabilities in code isn’t the easiest thing to do, especially when coming back weeks later. I still need to find a better way of documenting a source of truth, other than in code. Additionally, finding the right probabilities heavily relies on a trial-and-error approach, making some of the numbers feel very random.

This is the reason I heavily rely on integration-tests, which allow me to validate real-world recordings with changes I make to the algorithm. But because Switch stores all the data locally on the phone, this data isn’t easily accessible. ADB can only download files from the app folder when it’s in debug mode, but not when running a production build.

I ended up implementing a very simple data-sharing functionality that encrypts the SQLite database with a shared key, and uploads it to a specific GitHub branch for later retrieval and decryption. Github seems like an odd choice, but I did not want to set up a CDN, or file-server for uploading debug-data. This may change in the future, where I imagine having a built-in backup & recover functionality, but for now it serves my purpose well enough.

Future Work

Switch isn’t a fully polished and reliable product yet, and I was the only beta-tester thus far. Nevertheless, I have tracked over 10,000km in the past six months. and the math works out. The KlimaTicket is definitely worth it.

In order for Switch to gain more maturity, I would love to get some feedback through beta testers. If you have an Android phone, and are interested in joining the test-group, just reach out to me. Happy to chat.