Project 1: San Diego Collision EDA
Why it matters?
In Korea, car navigation systems notify drivers when they enter ‘accident-prone areas,’ helping them stay alert and avoid potential accidents. However, similar services are not commonly available in navigation systems in the US. As a result, I became curious about identifying accident-prone areas in San Diego to raise awareness when I drive in the future.
Moreover, as an inexperienced driver, I have been told lane changes are the most challenging task when driving in Korea, partly due to narrow lanes. With the wider lanes in the US, I am interested in investigating whether lane changes are still the most challenging task or if other actions demand greater caution.
Overview
- Used Traffic Collisions - basic reports and Police Beats from City of San Diego Open Data Portal and grid map of San Diego from SanGis
- Applied geocoding to unique addresses using Python and Google Geocoding API and stored them in SQLite database
- Created interactive visualization with Tableau so that viewers can look at specific locations in San Diego County
Data Collection
Original dataset
1. Traffic Collisions - basic reports
The following are sample rows from the dataset that contains records from 2015 January to 2023 May. It’s downloaded from the City of San Diego Open Data Portal.
report_id | date_time | police_beat | address_no_primary | address_pd_primary | address_road_primary | address_sfx_primary | violation_section | violation_type | charge_desc | injured | killed | hit_run_lvl |
---|---|---|---|---|---|---|---|---|---|---|---|---|
171111 | 2015-01-14 20:00:00 | 835 | 4200 | JUNIPER | STREET | MISC-HAZ | VC | MISCELLANEOUS HAZARDOUS VIOLATIONS OF THE VEHICLE CODE | 0 | 0 | MISDEMEANOR | |
192016 | 2015-03-19 12:00:00 | 622 | 5200 | LINDA VISTA | ROAD | MISC-HAZ | VC | MISCELLANEOUS HAZARDOUS VIOLATIONS OF THE VEHICLE CODE | 0 | 0 | MISDEMEANOR | |
190012 | 2015-03-24 03:05:00 | 626 | 1000 | W | WASHINGTON | STREET | 22107 | VC | TURNING MOVEMENTS AND REQUIRED SIGNALS | 2 | 0 | nan |
191866 | 2015-03-27 23:56:00 | 613 | 2800 | WORDEN | STREET | 22107 | VC | TURNING MOVEMENTS AND REQUIRED SIGNALS | 1 | 0 | nan | |
185207 | 2015-07-06 11:45:00 | 813 | 2800 | EL CAJON | BOULEVARD | 20002(A) | VC | HIT AND RUN | 0 | 0 | MISDEMEANOR |
2. Police Beats
The following are the mapping of police_beat values from the above dataset into neighbor names in San Diego. It also contains the shp files of each neighbor.
beat | neighborhood |
---|---|
111 | Clairemont Mesa East |
112 | Clairemont Mesa West |
113 | Bay Ho |
114 | North Clairemont |
115 | University City |

3. Grid Map of San Diego
The following is a sub-grid of the Regional Public Safety Geodatabase’s Public Safety Grid. Each grid is about 0.3 miles * 0.6 miles (width * height).

Geocoding
Used this script to geocode addresses using Google Geocoding API and store in a SQLite database.

Data Cleaning
1. charge_desc Mapping
Because charge_desc (charge description) values in the Traffic Collisions - basic reports are hand recorded by different police officers, different abbreviations and wordings are used throughout the rows. Thus, I needed to group them into higher-level categories. Below is a sample mapping.
original_cause | abbreviate_cause |
---|---|
DRVG WITHOUT VALID DRVR’S LIC (M) | without valid license |
TURNING MOVEMENTS AND REQUIRED SIGNALS | unsafe turn |
HIT AND RUN RESULTING IN DEATH OR INJURY(IBR 90Z) | hit and run |
PEDESTRIAN NOT TO SUDDENLY ENTER PATH, ETC | pedestrian |
FAIL TO STOP AT LIMIT LINE AT RR CROSSING (I) | stop violation |
2. Geocoding Cleaning
Google Geocoding API was unable to locate some addresses due to incorrect addresses and interpret Spanish abbreviations such as CAM for Camino and CMT for Caminito. Thus, I needed to confirm those by hand.
3. Shape File Cleaning
Since I’m only interested in collisions in San Diego, I used the shp file from Police Beats to only extract grids that overlap with polygons from Police Beats.
# using intersects() from geopandas
extract_necessary_grids = [np.any(police_beat['geometry'].intersects(g))
for g in sd_grid['geometry']]
sd_grid_trim = sd_grid[extract_necessary_grids]
Visualization
