The term address geocoding often refers to the process of transforming a textual representation of an address into an equivalent latitudinal and longitudinal coordinate interpretation. For example, taking an address such as:
27 E. Cota St.
Santa Barbara, CA 93101
USA
And representing it as a set of latitudinal and longitudinal coordinates, such as:
Degrees-Minutes-Seconds (DMS) Coordinates: 34°25’07.4″N, 119°41’46.4″W
Decimal Degree (DD) Coordinates: 34.418721, -119.696215
This process, sometimes referred to as forward-geocoding, may seem simple and straightforward in concept; however, in practice, it can be quite complex and at times confusing. Most address geocoding systems will take an address and attempt to return a single set of coordinates that point to the address location. Coordinate points can vary in precision and depending on a multitude of factors, the point may be accurate for the intended location or it may instead be representative of the general area.
Coordinate Points and Boundaries
Points are represented by a single set of latitudinal and longitudinal (lat/lon) coordinates, and they are often displayed and stored as decimal degree (DD) coordinates instead of the more traditional degrees-minutes-seconds (DMS) coordinates. This is because DD coordinates, also commonly referred to as decimal coordinates, are often easier to work with programmatically than DMS coordinates. In some cases, you may see a point represented as a set of X and Y coordinates, where X represents the longitudinal coordinate and Y the latitudinal coordinate.
Lat/Lon Decimal Coordinates: 34.418721, -119.696215
X,Y Decimal Coordinates: -119.696215, 34.418721
While the values of the two are the same, the order is reversed. For ease of use and to avoid confusion, most geocoding systems will display each coordinate individually.
Latitude: 34.418721
Longitude: -119.696215
Boundaries are usually represented by multiple sets of coordinates that can be connected by lines to form a closed non-overlapping polygonal shape, where the shape represents the boundary of a feature such as a building, property, or an administrative boundary such as a city, county or state.
Boundary: POLYGON ((-119.696638 34.418639, -119.696413 34.418869, -119.696158 34.418701, -119.695917 34.418904, -119.695772 34.418809, -119.696265 34.418351))
Coordinate points are often calculated – sometimes by interpolation, or by taking the average of multiple points for an area, or by calculating the center of a boundary – commonly referred to as a centroid. Let’s take a brief look at some common geocoding coordinate resolution levels to better understand how they are achieved and what they represent.
Rooftop and Property Level
When it comes to geocoding addresses, most users are ideally looking to get a set of coordinates that will point to the rooftop of the address. Rooftop level coordinates work well for single address buildings, such as a house in the suburbs. This is because coordinate points are often entered by a user or are generated by a process that uses machine learning algorithms to analyze satellite image data.
In the latter method, structures such as houses and office buildings are identified so that a general outline of the structure can be recreated as a boundary. From this boundary, a centroid can be calculated, which works well as a rooftop level coordinate point for single address buildings, but not so well for large multi-address buildings such as malls and apartments.
When it comes to large multi-address buildings such as commercial strip malls or industrial complexes, there usually aren’t any clear defining borders on the roofs of the structures that the AI can use to distinguish one address from another; therefore, the entirety of the building is commonly treated as one address point.
Not all coordinate points are generated using AI or user inputs. In some cases, property boundary data that was gathered by the local municipality may be available. This property boundary data can vary and may often be used to describe a plot, lot, tract or parcel of land. The boundary data can be used to calculate a single centroid point, which in turn can be used to describe the entire property.
Properties may contain multiple addresses and buildings, and depending on the size of the property, a centroid may not adequately describe the address or addresses residing in the property. For example, a large estate or farm or industrial factory may be one large property and a coordinate point calculated from a boundary may not be anywhere near an actual building. While this is unfortunate, sometimes the simple reality is that higher quality data is not always readily available. This is often the case for rural areas such as farmlands.
The image above is an example of residential neighborhood where boundaries of the houses have been identified.
In the Google Maps image above we have an outlet mall with many clothing stores, but based on the rooftop image alone, there is no easy way for an AI to determine how many stores or units there are. The best it can do is simply outline the buildings, as shown in the next image.
Thoroughfare Levels
A thoroughfare is a transportation route between one location and another. On land, it is more commonly referred to as a type of road or route that is typically used by motorized vehicles, such as a street, avenue or highway. When an address geocoding system is unable to find an address match it may sometimes fall back to thoroughfare level data if it is available.
Thoroughfare data often consists of coordinate lines that closely follow the path of the road they are representing, just like a street map without a satellite image background. The data will often contain the street name or some form of identifier, and in some cases, it may contain additional data like a block range. With block range data a geocoding system may estimate where an address might be located using address interpolation.
Interpolated Address
A block range is commonly used to describe the address numbers that a block may contain. For example, let’s say we have a block in a residential area and its block range is 101 to 199, odd numbers only. The other side of the street has a block range of 100 to 198, even numbers only. The even block range is evenly divided into five suburban houses with address numbers 100, 120, 140, 160 and 180. If rooftop and property-level data are unavailable but street and block range data are available, then a geocoding system may still be able to return accurate coordinates for each address using interpolation. Simply put, address interpolation is a method of estimating where an unknown address point may lie within a known range.
Now let’s look at the odd-numbered block range on the other side of the street. This block is divided into three parts: two suburban sized houses and one apartment complex. The two suburban houses are assigned address numbers 101 and 105 and the apartment complex is assigned address number 109. In this scenario address, interpolation will fail to return an accurate coordinate point for each of these addresses. Due to the large address range and disproportionately sized properties dividing the block, the coordinates for the address numbers 101, 105 and 109 would all likely fall close to the first suburban house. This is because, without any other data to work with, address interpolation assumes equal and linear distribution. This is why address interpolation does not work well in some cases like rural roads and highways.
Street Segments
Unfortunately, block range data is not always available, or sometimes a block range that matches the given address number cannot be found. When an address range is unavailable then all that a geocoding system may have to work with is simply a street name. Streets are commonly broken into line segments with start and end coordinate points, or in some cases, a centroid point may also be available as supplementary data, and if one is not then it can be calculated. If an address range is unavailable but one or more street segments are found, then a geocoding system may choose to return a point for one of those segments.
If the overall length of the street is small, then returning a coordinate for a point in this street could potentially help a user get close to their intended destination. However, if the street is quite long then returning just any street segment coordinate point could put one far from their intended destination. For example, Rosecrans Avenue in California runs almost 30 miles in length and spans across two counties, Los Angeles County and Orange County. Returning a coordinate point for one of its segments or even a calculated centroid may not be helpful to the end-user. However, there are some use cases where a user may not necessarily be interested in where a street is exactly located and instead is looking to see if the street simply exists. Geocoding systems may be overkill for such a need, but since they often contain plenty of street data, they are sometimes the easiest tool for these types of users to work with.
Locality and Administrative Area Levels
Administrative areas (admin areas) are the regions in which a country is divided into. Each region typically has a defined boundary with an administration that performs some level of government functions. Various administrative levels exist, and these areas are commonly expected to manage themselves with a certain level of autonomy.
The US, for example, is made up of states (first level admin area), which are divided into counties (second level admin area) that consist of municipalities (third level admin area). For comparison, the United Kingdom (GB) is comprised of the four countries of England, Scotland, Wales and Northern Ireland (first level admin area). These countries are made up of counties, districts, and shires (second level admin areas), which in turn are made up of cities and towns (third level admin area) and small villages and parishes (fourth level admin area).
Localities are population clusters that are commonly recognized as cities, towns, and villages. Localities and sub-localities are often considered lower-level admin areas in that they reside towards the bottom end of the admin area hierarchy if at all. While a locality may technically be an admin area, it is usually a high, first or second level admin area that gets associated as an address’ admin area. Take the following US address for example,
27 E. Cota St.
Santa Barbara, CA 93101
The state of California (CA) would be assigned as the admin area while the city of Santa Barbara is associated to the locality. However, let’s say for example that someone was looking for a particular neighborhood in a city instead of an address. Perhaps they want to go sightseeing or visit a place of interest. For example, let’s say they entered the following search.
Funk Zone, Santa Barbara
Some data sets are organized in simple parent-child relationships without information describing if one is a locality or admin area. Therefore a geocoding system may end up returning a centroid coordinate for the Funk Zone boundary of 34.41506, -119.69010 with locality returning a value of ‘Funk Zone’ and admin area returning ‘Santa Barbara’. Technically this would not be incorrect, but it could be considered as an inconsistency in behavior that may prove problematic for integrations. For some users, the descriptive data that accompanies a lat/long coordinate point is just as important as the coordinate itself or even more so. However, additional data is not always available or distinguishable as the scope changes. Jumping from one level of data to another means that some supplemental data that was available in one set may not be available in another, and linking the two is not always possible.
Postal Code Level
Postal codes can represent an area as large as an entire region or a place as small as a single building. In the US, a 5-digit ZIP code may cover addresses residing in several small towns but a complete ZIP+4 code could cover addresses in a single block or block group. In the UK, a postcode may represent a group of addresses on a street or on a part of a street, a group of premises, or a single premise. If a postal code has a lat/long coordinate point associated with it, then that point should reflect the resolution of the postal code it is associated with. In the US for example, a lat/long coordinate for a 5-digit ZIP would be considered low in resolution as the area it covers would generally be considered large and vast. A UK postcode, on the other hand, could be considered high in resolution since it often represents a single side of a street or even a single premise.
Putting it all together
With so many different types of data coming from different sources, getting them all consolidated can be a real challenge. Names often don’t match between different data sets. Some may use abbreviations, alternate names or simply IDs; some maybe have their names parsed differently or organized differently. All of this adds to the difficulty of successfully and accurately geocoding a place or address to a set of lat/long coordinates.
Traditional geocoding systems may use a Geographic Information System (GIS) data first approach while others may choose to go with a full-text search approach. Different approaches all have their own set of strengths and weaknesses. Some weaknesses, when acknowledged, can be managed and minimized to a degree but not always. Which is why it is sometimes safer for a geocoding system to not return a result instead of potentially returning a highly inaccurate one. We’ve all heard stories of people blindly following their GPS directions and driving into a lake or forest or worse.
In order to get the best out of a geocoding system, it is important to know what level of data you are starting off with and what level of data you can get from it. Low-quality address data will often lead to low-quality results with coordinates of lower resolution such as locality or admin area. Geocoding services can work quite well when paired with an address validation service, but it is not uncommon for mailing addresses to differ from physical addresses. This can sometimes lead to a match not being found or worse yet an inaccurate match. For some users, it may be preferable to return a lower resolution level coordinate match than a higher one if the higher one is inaccurate. So be sure to thoroughly test a geocoder service and choose one that not only fits your needs but is flexible to meet your future needs.