A Perspective of Property Value and Police Proximity to Crime Rates at the San Diego Blue Line Trolley¶
Permissions¶
Place an X in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).
- YES - make available
- [ ] NO - keep private
Names¶
- Vuong Bui
- Emily Chen
- William Heng
- Sharon Ni
- Rina Pecherskaya
Abstract¶
Our study delves into the crime patterns observed at San Diego MST Blue Line Trolley stops, a public transportation system often used by San Diego residents, and especially UCSD student commuters. For our specific research, we aim to examine the difference in crime rates between trolley stops located within and outside of La Jolla. Therefore, our analysis classifies trolley stops into two regions: La Jolla and non-La Jolla. To generate statistical evidence to validate our hypothesis, we will use a permutation test to examine whether the observed differences between the two regions are statistically significant.
Additionally, to understand potential factors that correlate with differences in crime rates, we will also explore the trend of mean home values and police station proximity and their relationship to crime rates. These insights can help us understand the socioeconomic dynamics and accessibility to social services that may influence crime patterns within the vicinity of trolley stops. Our findings indicate a potential correlation with home values and a weak correlation with proximity to police stations. These results are all represented and supported by our plots. We will also conduct exploratory data analysis to examine the most common crime types at trolley stops, providing a comprehensive overview of crimes characteristic of trolley stops.
Furthermore, our exploratory data indicates a possible relationship between crime rates and location. Initially, we studied relationships based on zip codes but later found zip codes are similar to location, research, business, and more. Since the zip code variables depend on multitudes of factors, we restricted our location proxy by neighborhoods since neighborhoods are communities that are primarily related by distance. As a result, we performed a short geospatial analysis to notice the inverse relationship between property values and crime incidents with respect to the neighborhoods. Based on our findings, it is very possible that neighborhoods of higher property values could have lower crime incident rates.
These insights potentially underscore the need to enhance safety across the trolley system.
Research Question¶
Do La Jolla trolley stops on the Blue Line experience higher rates of criminal incidents compared to non-La Jolla trolley stops within San Diego? If so, do property value and proximity to police stations correlate with these these patterns?
Background and Prior Work¶
First and foremost, the MTS Trolley service is broken down into the Blue, Orange, and Green Line. To narrow our research, we will focus on stops associated with the Blue Line. We found that most of La Jolla uses the zipcode 92037. Since UCSD is in La Jolla and uses 92093, and 92092 for its PO Boxes, we will also need those ZIP codes. Then, we will define "La Jolla trolley stops” with any of the three listed ZIP codes: 92037, 92093, and 92092. For context, there are 32 Blue Line trolley stops. Out of those, three will be considered the La Jolla trolley stops. They are the following: UC San Diego Health La Jolla, UC San Diego Central Campus, and Nobel Drive. More details about the Blue Line trolley stops can be found here.
Research on the correlation between public transportation and crime rates have been done in the past, indicating varied results based on location and socioeconomic factors. Studies have shown that areas surrounding transit stops can experience higher crime rates due a multitude of factors ranging from socioeconmic aspects, such as affluence and average education level, to physical aspects, such as increased pedestrian traffic and anonymity for offenders. For our research focus, we will dive into the socioeconomic factor of affluence, and the physical factor of proximity to police stations.
One keystone study that will help enable our research is done by Liggett, Loukaitou-Sideris, and Iseki (2003)1 found that certain Los Angeles Metro stations had higher crime rates, attributed to social and physical characteristics of the stations and their neighborhoods. Methodologies used in these studies often involve spatial analysis using regression analysis and statistical models to correlate crime data with socioeconomic variables. By applying similar methodologies to our study, we can provide insights into whether La Jolla trolley stops, situated in wealthier areas, experience different crime patterns compared to stops in lesser affluent parts of San Diego (Non-La Jolla trolley stops).
Another effective method to conduct our research and to overall better understand the dynamics is to include crime trends over time. Prior research, such as that by Block and Block (2000)2, emphasizes the importance of temporal patterns in crime analysis, highlighting how crime rates can fluctuate based on factors like economic conditions and law enforcement proximities. They did this by using Geographic Information Systems (GIS) to map crime incidents and analyzing the spatial patterns. Additionally, other studies have explored the impact of property values and proximity to police stations on crime rates. For example, the study by Taylor (1995)3 suggests that higher property values might deter crime due to better security measures, and closer proximity to police stations can enhance surveillance which also reduce crime rates.
Incorporating these variables into our analysis and studies will help determine if affluence and proximity to police stations contribute to the observed and predicted crime patterns at La Jolla trolley stops compared to Non-La Jolla trolley stops.
References:
- ^ BLiggett, R., Loukaitou-Sideris, A., & Iseki, H. (2003). Journeys to Crime: Assessing the Effects of a Light Rail Line on Crime in the Neighborhoods. Journal of Urban Affairs, 25(2), 165-184.
- ^ Block, R., & Block, C. (2000). The Bronx and Chicago: Street Robbery in the Environs of Rapid Transit Stations. Journal of Transportation and Statistics, 3(3), 29-36.
- ^ Taylor, R. B. (1995). The Impact of Crime on Communities. The Annals of the American Academy of Political and Social Science, 539(1), 28-45.
Hypothesis¶
We believe that there are lower levels of criminal incidents at La Jolla trolley stops on the Blue Line compared to non-La Jolla trolley stops. This is due to factors including higher property values and closer proximity to police stations which can potentially lead to better security and access to social services.
Main hypothesis: We believe there are lower levels of criminal incidents at La Jolla trolley stops on the Blue Line compared to non-La Jolla trolley stops within San Diego.
Null hypothesis: There is no significant difference in the rates of criminal incidents between La Jolla trolley stops on the Blue Line and non-La Jolla trolley stops within San Diego.
Alternate hypothesis: La Jolla trolley stops on the Blue Line experience lower rates of criminal incidents compared to non-La Jolla trolley stops within San Diego.
The significance level is 5%.
Data¶
The ideal dataset that we want to answer this question would have to be about San Diego's crime rates. The variables should have the crime description, the time of crime being reported, and the location of the crime. Given that the MTS Trolley Blue Line with the extension of the nine stops was opened on November 21, 2021, we will filter datasets to be relevant to this timeline. The observations should keep up with the most recent year possible.
Our group will collect data through verifiable and reputable public domains. Another data source that we will turn to is research papers and articles written by credible experts.
In addition to ensuring our data sources are reliable, we will also perform ethical evaluations on our data. This includes but is not limited to studying if transparency, privacy, and consent are respected throughout the process of obtaining, processing, and publishing the data.
The data will be stored in the repository. For data of large file sizes, we can carefully consider what should be needed with an explanation to reduce the file size, and access only the data that are useful for us. Otherwise, if no reason is applied, we can access the data locally and update the analyses remotely. In either case, we will be transparent of our data analysis process.
Some potential real datasets that we could provide include San Diego Police NIBRS Crime Offense dataset. This dataset tells us information about crime information in San Diego. This is useful because the dataset includes the act of crime with the accompanying geographical location. This is useful because we want to see if there are significant crime rates in the La Jolla trolley stops versus non-La Jolla trolley stops. It is also gettable and it does not differ too much from the ideal. This data can be found here, and there are 359564 observations with 31 variables.
We can also look into property prices in San Diego. According to our background and prior works, Brant ingham and Wong inferred that the Mid-Coast Trolley station could possibly lead to a generation of more crimes because it standardized spatial and temporal characteristics of the area. They surmised that it could provide perpetrators with access to affluent neighborhoods. We hope that by studying property prices around the La Jolla and non-La Jolla trolley stations, we can gain insight into whether it is a statistically significant variable that can influence crime rates. This data is also gettable through Zillow Housing Data. This data can be found in here, and there are 21558 observations with 299 variables.
Moreover, we want to investigate how police stations’ proximity to trolley stations may provide insights into crime patterns. Since the number of police stations is about 30, we can convert this information into a CSV file. A summary of these police stations can be found here, and a more detailed breakdown of the police station division from the government can be found here.
Data overview¶
For each raw dataset we have
- Dataset #1
- Dataset Name:
sd_crimes_report.csv - Link to the dataset: crimes fp
- Number of observations: 359564
- Number of variables: 31
- Dataset Name:
- Dataset #2
- Dataset Name:
sd_home_values.csv - Link to the dataset: homes fp
- Number of observations: 21558
- Number of variables: 299
- Dataset Name:
- Dataset #3
- Dataset Name:
sd_police_stations.csv - Link to the dataset: Dataset Folder
- Number of observations: 30
- Number of variables: 8
- Dataset Name:
For the San Diego Police NIBRS Crime Offenses dataset, some important variables could be nibrs_uniq, occured_on, day_of_week, month, year, division, ibr_category, code_section, city, zip, latitude, and longitude. The metrics and datatypes range from integers to strings and floats. The data likely needs to be cleaned because there appear to be typos and inconsistencies throughout. Chula Vista is a crucial location because the MTS Trolley Blue Line goes through it, but the police NIBRS data does not cover this area. Ideally, to prevent geographical bias, we will need to find a dataset with crime patterns that include Chula Vista.
For the San Diego police station dataset, Some important variables could be name, address, zipcode. The metrics and datatypes range from integers to strings. We want to see the police stations in San Diego. This can initiate a potential relationship between police station proximity and crime rates around the La Jolla and non-La Jolla trolley stops. Likewise, minimal cleaning and preprocessing will be needed, and we will extract the important variables from this dataset.
For the Zillow home values dataset, Some important variables could be RegionName, RegionType, State, Metro, CountyName, as well as the range of dates from the year 2021 to 2024 of the home values. Similar to dataset 1, the metrics and datatypes range from integers to strings and floats. For numerical values, they are represented as integers or float. Meanwhile, for descriptions, they are represented as strings. We want to see the property values around the La Jolla versus non-La Jolla trolley stops. This can initiate a potential relationship between property values and crime rates around the La Jolla versus non-La Jolla trolley stops. Minimal cleaning and preprocessing will be needed, and we will extract the important variables from this dataset.
We plan to combine these datasets by location–perhaps by zipcodes–to focus on the property and crime rates around La Jolla versus non-La Jolla trolley stops and potentially draw a relationship.
For each cleaned dataset we have
- Dataset #1
- Dataset Name:
cleaned_sd_crimes_report.csv - Link to the dataset: Dataset Folder
- Number of observations: 14489
- Number of variables: 14
- Dataset Name:
- Dataset #2
- Dataset Name:
cleaned_sd_home_values.csv - Link to the dataset: Dataset Folder
- Number of observations: 196
- Number of variables: 37
- Dataset Name:
# please install prior to running the notebook
# !pip install geopandas
# !pip install matplotlib
# !pip install missingno
# !pip install seaborn
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
from shapely.geometry import Point
from math import radians, sin, cos, sqrt, atan2
import seaborn as sns
Dataset #1 San Diego Police NIBRS Crime Offenses¶
This dataset is obtained from the City of San Diego website and comprises Crime Offense Data starting from 2021. It is extracted from the San Diego Police Department’s Records Management System. This dataset will primarily help us measure the rates of criminal activity surrounding various Blue Line trolley stops. The key variables we are interested in for this dataset include: nibrs_uniq, occurred_on, day_of_week, code_section, neighborhood, block_addr, city, state, zip,
latitude, longitude.
crimes_fp = 'sd_crimes_report.csv'
crimes_report_raw = pd.read_csv(crimes_fp)
crimes_report_raw.head()
/var/folders/8n/ff8_52m15k542z26_73z43x40000gn/T/ipykernel_12080/1687118722.py:2: DtypeWarning: Columns (2) have mixed types. Specify dtype option on import or set low_memory=False. crimes_report_raw = pd.read_csv(crimes_fp)
| objectid | nibrs_uniq | case_number | occured_on | approved_on | day_of_week | month | year | code_section | group_type | ... | division | block_addr | city | state | zip | query_run_date | geocode_status | geocode_score | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2893344_90Z | 21032576 | 2021-07-11 20:45:00 | 2021-07-11 23:13:45 | 1 | 7 | 2021 | 148 (A)(1) PC OBSTRUCT/RESIST PEACE OFCR/EMER ... | B | ... | Central | 100 31st ST | SAN DIEGO | CA | 92102.0 | 2023-09-11 08:41:49.000 | M | 96.30 | 32.706474 | -117.127455 |
| 1 | 2 | 2959571_23F | 22701933 | 2022-02-11 22:00:00 | 2022-02-14 20:38:50 | 6 | 2 | 2022 | 459 PC BURGLARY (VEHICLE) (F) || | A | ... | Northeastern | 10000 MAYA LINDA ROAD | SAN DIEGO | CA | 92126.0 | 2023-09-11 08:41:49.000 | M | 97.22 | 32.901073 | -117.120120 |
| 2 | 3 | 2872072_13B_2 | 21020993 | 2021-05-02 06:00:00 | 2021-05-06 19:12:50 | 1 | 5 | 2021 | 273.5 (A) PC SPOUSAL/COHABITANT ABUSE WITH MIN... | A | ... | Western | 3800 Greenwood ST | SAN DIEGO | CA | 92110.0 | 2023-09-11 08:41:49.000 | M | 96.97 | 32.754899 | -117.206022 |
| 3 | 4 | 3026521_13A_2 | 22041849 | 2022-09-21 12:46:21 | 2022-11-03 14:32:03 | 4 | 9 | 2022 | 273 A (A) PC WILLFUL CRUELTY TO CHILD: WITH IN... | A | ... | Unknown | NaN | Mecca | CA | 92254.0 | 2024-04-26 12:03:19.800 | U | 0.00 | NaN | NaN |
| 4 | 5 | 2836797_280 | 21001415 | 2021-01-09 11:23:00 | 2021-01-09 22:09:28 | 7 | 1 | 2021 | 10851 (A) VC OTHER AGENCY VEHICLE THEFT/RECOVE... | A | ... | Southern | 1100 Walnut AVE | CHULA VISTA | CA | 91911.0 | 2023-09-11 08:41:49.000 | M | 95.45 | 32.605965 | -117.088752 |
5 rows × 31 columns
Data Cleaning to optimize this dataset for our descriptive and inferential analyses to be as accurate as possible:¶
- Filter to include only criminal offenses that occurred after the opening of the Blue Line trolley stop at UCSD campus—November 21, 2021.
- Changed columns that display time information to pandas datetime objects to help group and query the data.
- Address the typos, especially in the 'city' column.
- Dropped unnecessary columns.
- Filter criminal activities within 0.25 km of each Blue Line Trolley stop for precision.
# of Observations After Cleaning: 14679
# of Variables After Cleaning: 11 original columns + 3 (min_distance : distance in kilometers from the location of the crime to the closest trolley stop, trolley_stop: the name of the closest trolley stop, la jolla: boolean indicating whether the stop is categorized as a 'La Jolla trolley stop')
crimes_report = crimes_report_raw.rename(columns={'occured_on': 'occurred_on'})
crimes_report['occurred_on'] = pd.to_datetime(crimes_report['occurred_on'])
crimes_report = crimes_report.loc[crimes_report['occurred_on'] >= pd.to_datetime('2021-11-21')]
crimes_report['city'] = (crimes_report['city'].str.upper()
.replace(['SANDIEGO', 'SAN DEIGO', 'SAN DIEGO, CA', 'SAN DIEGO, SAN DIEGO','SAN DIEGO, CA, USA',
'SD', 'SAN D', 'SAN DIEGO', 'SAD DIEGO', 'SAN D107IEGO', 'SD92129', 'PSAN DIEGO', 'SAB DIEGO',
'SAN DIEGO,', 'SAN DIEGO CA', 'SAN CIEGO', 'SAN DIGO', 'SAN DIEGO CALIFORNIA', 'SAN DIEG',
'SAN DIGEO', 'BSAN DIEGO', 'SA DIEGO'], 'SAN DIEGO')
.replace(['SAN YSIDRO (SB)', 'SAN YSIDRO C', 'SAN YSIRDO', 'SAN YSDIRO', 'SAN YSIDRO'], 'SAN YSIDRO')
.replace(['LA. JOLLA', 'LA JOLA'], 'LA JOLLA'))
col_drop = ['objectid', 'case_number', 'approved_on', 'month', 'year', 'group_type', 'ibr_category', 'crime_against',
'ibr_offense', 'ibr_offense_description', 'pd_offense_category', 'violent_crime', 'property_crime', 'service_area',
'division_number', 'division', 'query_run_date', 'geocode_score', 'geocode_status', 'beat']
crimes_report = crimes_report.drop(columns=col_drop)
crimes_report = crimes_report.dropna(subset=['latitude', 'longitude'], how='any')
crimes_report.head()
| nibrs_uniq | occurred_on | day_of_week | code_section | neighborhood | block_addr | city | state | zip | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2959571_23F | 2022-02-11 22:00:00 | 6 | 459 PC BURGLARY (VEHICLE) (F) || | Mira Mesa | 10000 MAYA LINDA ROAD | SAN DIEGO | CA | 92126.0 | 32.901073 | -117.120120 |
| 5 | 3074283_240_1 | 2023-03-04 01:04:00 | 7 | 10851 (A) VC TAKE VEHICLE W/O OWNER'S CONSENT/... | Rolando | 4500 Mataro DR | SAN DIEGO | CA | 92115.0 | 32.759579 | -117.055797 |
| 7 | 3022336_120 | 2022-09-13 01:27:00 | 3 | 211 PC ROBBERY (F) || | Ocean Beach | Santa Monica AVE & sunset Cliffs BLVD | SAN DIEGO | CA | 92107.0 | 32.745571 | -117.246865 |
| 9 | 3043773_13B_2 | 2022-11-19 12:30:27 | 7 | 242 PC SIMPLE BATTERY (M) || | East Village | 1500 Commercial ST | SAN DIEGO | CA | 92113.0 | 32.705280 | -117.150274 |
| 10 | 2999687_520 | 2022-06-29 20:27:00 | 4 | 29800 (A)(1) PC FELON/ADDICT/POSSESS/ETC FIREA... | Unknown | 8300 Verde Ridge RD | CN | CA | 91977.0 | 32.695627 | -117.021630 |
To filter the criminal activities within 0.25 km of each Blue Line Trolley stop for precision, we did the following:
- Define each Blue Line stop location by (latitude, longitude) using Google Maps.
- Find haversine distance from crime to trolley stops because the Earth is round.
- Filter to include crime reports that were 0.25 km from a trolley stop because crime reports that were further are likely irrelevant to our study.
- Categorized trolley stops whether they belong in La Jolla neighborhood or not. Namely, the only three bus stops that we have defined to be in the La Jolla neighborhood are
san diego health la jolla,uc san diego central campus, andnobel drive.
BL_STOP_LOCATIONS = {'utc': (32.87132037497218, -117.21143707227878),
'executive drive': (32.87420314516726, -117.21402421305726),
'uc san diego health la jolla': (32.88192985653163, -117.22354801580396),
'uc san diego central campus': (32.878489138561804, -117.23170614117566),
'va medical center': (32.874233328943845, -117.22976544174801),
'nobel drive': (32.867049300090656, -117.23037640267916),
'balboa avenue': (32.80580312583446, -117.21402421302811),
'clairemont drive': (32.79054726826523, -117.20604090586127),
'tecolote road': (32.77033205587625, -117.20503461726902),
'old town': (32.75395964857332, -117.19934982724807),
'washington street': (32.741968130063306, -117.18441623410449),
'middletown': (32.73420387391221, -117.17489221770487),
'county center/little italy': (32.72153187161157, -117.16981514496572),
'santa fe depot': (32.71746871384775, -117.16979527062813),
'america plaza': (32.71653232198834, -117.16894067410999),
'civic center': (32.71674969953285, -117.16228277100737),
'fifth avenue': (32.71693363396822, -117.15948048941088),
'city college': (32.71611428673004, -117.1540547952427),
'park & market': (32.71170310445124, -117.15373390406646),
'12th & imperial': (32.70635181567131, -117.15333641731496),
'barrio logan': (32.698458078880975, -117.1466586398805),
'harborside': (32.691583938485174, -117.13296522125287),
'pacific fleet': (32.68658271056329, -117.12477699415423),
'8th street': (32.67528877278743, -117.11310796123549),
'24th street': (32.66241271360724, -117.10798544931745),
'e street': (32.63949449945585, -117.09881022422739),
'h street': (32.63075771074484, -117.09526659182),
'palomar street': (32.603844319790944, -117.08529354675186),
'palm avenue': (32.58459821075564, -117.08400457618642),
'iris avenue': (32.569944065604936, -117.06690164944314),
'beyer blvd.': (32.5581046955884, -117.04697389590208),
'san ysidro': (32.54468994869954, -117.02957775934591)}
def find_min_haversine_distance_to_stop(start_lat, end_long, bl_stop_locations):
RADIUS_EARTH_KM = 6371.0
STOP_NAMES = list(bl_stop_locations.keys())
STOP_LOCATIONS = list(bl_stop_locations.values())
distance_from_stops_km = []
for stop_location in STOP_LOCATIONS:
stop_lat, stop_long = stop_location[0], stop_location[1]
crime_lat_radians, crime_long_radians, stop_lat_radians, stop_long_radians = map(radians, [start_lat, end_long, stop_lat, stop_long])
# Haversine formula
dlat = stop_lat_radians - crime_lat_radians
dlon = stop_long_radians - crime_long_radians
a = sin(dlat / 2)**2 + cos(crime_lat_radians) * cos(stop_lat_radians) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance_from_stops_km.append(RADIUS_EARTH_KM * c)
min_distance = min(distance_from_stops_km)
closest_stop = STOP_NAMES[distance_from_stops_km.index(min_distance)]
return pd.Series([min_distance, closest_stop])
crimes_report[['min_distance', 'closest_stop']] = crimes_report.apply(lambda row: find_min_haversine_distance_to_stop(row['latitude'], row['longitude'],
BL_STOP_LOCATIONS), axis=1)
crimes_report = crimes_report[crimes_report['min_distance'] <= 0.25]
LA_JOLLA_STOPS = ['uc san diego health la jolla', 'uc san diego central campus','nobel drive']
crimes_report['la_jolla'] = crimes_report['closest_stop'].apply(lambda x: x in LA_JOLLA_STOPS)
crimes_report.to_csv('cleaned_sd_crimes_report.csv', index=False)
crimes_report.head()
| nibrs_uniq | occurred_on | day_of_week | code_section | neighborhood | block_addr | city | state | zip | latitude | longitude | min_distance | closest_stop | la_jolla | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 75 | 3044145_13A_2 | 2022-11-23 13:43:45 | 4 | 245 (A)(1) PC ASSAULT W/DEADLY WEAPON:NOT F/AR... | East Village | 14th ST & national AVE | SAN DIEGO | CA | 92113.0 | 32.705407 | -117.151901 | 0.170503 | 12th & imperial | False |
| 106 | 3054619_200 | 2022-12-27 18:44:00 | 3 | 451 (D) PC ARSON:PROP (F) || | Core-Columbia | 900 10th AVE | SAN DIEGO | CA | 92101.0 | 32.714746 | -117.155664 | 0.214081 | city college | False |
| 173 | 3099629_90Z | 2023-05-23 17:26:00 | 3 | 5150 WI MENTAL DISORDER 72 HR OBSERVATION || | Petco Park | 1200 Imperial AVE | SAN DIEGO | CA | 92101.0 | 32.706329 | -117.153635 | 0.028090 | 12th & imperial | False |
| 211 | 3007831_13A_2 | 2022-07-27 10:00:39 | 4 | 245 (A)(4) PC ADW WITH FORCE:POSSIBLE GBI (M) || | East Village | 100 14th ST | SAN DIEGO | CA | 92101.0 | 32.706341 | -117.151950 | 0.129742 | 12th & imperial | False |
| 258 | 3061756_90E | 2023-01-22 20:55:00 | 1 | 647 (F) PC DRUNK IN PUBLIC: ALCOHOL, DRUGS, CO... | Midtown | 1700 Hancock ST | SAN DIEGO | CA | 92101.0 | 32.741995 | -117.183137 | 0.119652 | washington street | False |
Dataset #2 Zillow Home Values¶
This dataset is obtained from Zillow. It provides data about the typical home value across a given region for homes in the 35th to 65th percentile range—the Zillow Home Value Index (ZHVI). The prices date back to 2000-01-31, with the ZHVI being recorded at the end of every month. Some key variables we are interested in from this dataframe include RegionName, RegionType, State, Metro, CountyName, and dates starting from October 31st, 2021.
home_values_fp = 'zillow_home_values.csv'
zillow_home_values = pd.read_csv(home_values_fp)
zillow_home_values.head()
| RegionID | SizeRank | RegionName | RegionType | StateName | State | City | Metro | CountyName | 2000-01-31 | ... | 2023-06-30 | 2023-07-31 | 2023-08-31 | 2023-09-30 | 2023-10-31 | 2023-11-30 | 2023-12-31 | 2024-01-31 | 2024-02-29 | 2024-03-31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 112345 | 0 | Maryvale | neighborhood | AZ | AZ | Phoenix | Phoenix-Mesa-Chandler, AZ | Maricopa County | 68055.916438 | ... | 3.226864e+05 | 3.251916e+05 | 3.282309e+05 | 3.309207e+05 | 3.333738e+05 | 3.354533e+05 | 3.370457e+05 | 3.380757e+05 | 3.390568e+05 | 3.405442e+05 |
| 1 | 192689 | 1 | Paradise | neighborhood | NV | NV | Las Vegas | Las Vegas-Henderson-Paradise, NV | Clark County | 135267.765998 | ... | 3.658649e+05 | 3.676967e+05 | 3.706294e+05 | 3.735335e+05 | 3.760529e+05 | 3.782375e+05 | 3.803553e+05 | 3.822835e+05 | 3.841350e+05 | 3.859885e+05 |
| 2 | 270958 | 2 | Upper West Side | neighborhood | NY | NY | New York | New York-Newark-Jersey City, NY-NJ-PA | New York County | 411066.239358 | ... | 1.341331e+06 | 1.334758e+06 | 1.324559e+06 | 1.314097e+06 | 1.302547e+06 | 1.290178e+06 | 1.282332e+06 | 1.276492e+06 | 1.270069e+06 | 1.264422e+06 |
| 3 | 270957 | 3 | Upper East Side | neighborhood | NY | NY | New York | New York-Newark-Jersey City, NY-NJ-PA | New York County | 659533.234940 | ... | 1.294462e+06 | 1.289979e+06 | 1.285379e+06 | 1.280716e+06 | 1.272250e+06 | 1.260767e+06 | 1.250209e+06 | 1.243174e+06 | 1.236986e+06 | 1.232940e+06 |
| 4 | 118208 | 4 | South Los Angeles | neighborhood | CA | CA | Los Angeles | Los Angeles-Long Beach-Anaheim, CA | Los Angeles County | 134786.397244 | ... | 6.582787e+05 | 6.658748e+05 | 6.760560e+05 | 6.863625e+05 | 6.951129e+05 | 7.019075e+05 | 7.063378e+05 | 7.039896e+05 | 6.972987e+05 | 6.903372e+05 |
5 rows × 300 columns
1. How was sd_home_values created?¶
- Filtered dataset to include only San Diego County properties.
- Filtered dataset to include prices from one month before
2021-11-30(the opening of the Blue Line trolley stop at UCSD campus) to the most recent month. - Dropped unnecessary columns and converted to csv.
sd_home_values = zillow_home_values[zillow_home_values['CountyName'] == 'San Diego County']
num_cols = len(sd_home_values.columns)
sd_home_values = sd_home_values.iloc[:, list(range(9)) + list(range(num_cols - 30, num_cols))]
cols_drop = ['SizeRank', 'StateName']
sd_home_values = sd_home_values.drop(columns=cols_drop)
sd_home_values.to_csv('cleaned_sd_home_values.csv', index=False)
sd_home_values.head()
| RegionID | RegionName | RegionType | State | City | Metro | CountyName | 2021-10-31 | 2021-11-30 | 2021-12-31 | ... | 2023-06-30 | 2023-07-31 | 2023-08-31 | 2023-09-30 | 2023-10-31 | 2023-11-30 | 2023-12-31 | 2024-01-31 | 2024-02-29 | 2024-03-31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | 116625 | Mira Mesa | neighborhood | CA | San Diego | San Diego-Chula Vista-Carlsbad, CA | San Diego County | 7.909030e+05 | 7.947749e+05 | 8.029774e+05 | ... | 9.108561e+05 | 9.259285e+05 | 9.430418e+05 | 9.603600e+05 | 9.743040e+05 | 9.834509e+05 | 9.901680e+05 | 9.967258e+05 | 1.006417e+06 | 1.019400e+06 |
| 95 | 343228 | Southwest | neighborhood | CA | Chula Vista | San Diego-Chula Vista-Carlsbad, CA | San Diego County | 6.130249e+05 | 6.137920e+05 | 6.171048e+05 | ... | 6.934955e+05 | 7.034853e+05 | 7.139887e+05 | 7.236314e+05 | 7.319901e+05 | 7.382049e+05 | 7.430840e+05 | 7.467625e+05 | 7.509317e+05 | 7.570842e+05 |
| 172 | 343244 | Northwest | neighborhood | CA | Chula Vista | San Diego-Chula Vista-Carlsbad, CA | San Diego County | 6.452574e+05 | 6.464710e+05 | 6.507760e+05 | ... | 7.195701e+05 | 7.287672e+05 | 7.389097e+05 | 7.481860e+05 | 7.563439e+05 | 7.625271e+05 | 7.674295e+05 | 7.712530e+05 | 7.757498e+05 | 7.824709e+05 |
| 182 | 117557 | Rancho Penasquitos | neighborhood | CA | San Diego | San Diego-Chula Vista-Carlsbad, CA | San Diego County | 1.066124e+06 | 1.077560e+06 | 1.095914e+06 | ... | 1.260169e+06 | 1.280856e+06 | 1.303991e+06 | 1.328596e+06 | 1.349857e+06 | 1.364452e+06 | 1.371689e+06 | 1.375539e+06 | 1.385696e+06 | 1.405995e+06 |
| 197 | 273140 | Carmel Valley | neighborhood | CA | San Diego | San Diego-Chula Vista-Carlsbad, CA | San Diego County | 1.475926e+06 | 1.498744e+06 | 1.530745e+06 | ... | 1.767476e+06 | 1.793098e+06 | 1.821107e+06 | 1.846432e+06 | 1.869365e+06 | 1.886040e+06 | 1.895003e+06 | 1.901779e+06 | 1.916004e+06 | 1.945367e+06 |
5 rows × 37 columns
Using Neighborhoods to Study La Jolla versus Non-La Jolla Trolley Stops¶
Since we do not have the latitude and longitude of the property values, we can make the most out of our dataset by associating La Jolla Trolley stops with a particular neighbhorhood. One way that was done is outline below:
2. How was the crimes_and_homes created?¶
- Normalized names for neighborhoods.
- Merged the dataset
crimes_reportandsd_home_values. Then dropping rows with null values. - Found the average home values by neighborhood and store this information in the column
avg_home_value.
crimes_report['neighborhood'] = (crimes_report['neighborhood'].str.upper()
.replace('JAMACHA-LOMITA', 'JOMACHA-LOMITA')
.replace('KEARNEY MESA', 'KEARNY MESA')
.replace('GASLAMP', 'GASLAMP QUARTER')
.replace('TORREY PINES', 'LA JOLLA')
.replace('TIERRA SANTA', 'TIERRASANTA')
.replace('MIRAMAR', 'MIRAMAR RANCH NORTH')
.replace('MISSION BAY PARK', 'BAY PARK')
.replace('BORDER', 'SAN YSIDRO')
.replace('PETCO PARK', 'EAST VILLAGE')
)
sd_home_values['RegionName'] = sd_home_values['RegionName'].str.upper()
crimes_and_homes = crimes_report.merge(sd_home_values, left_on='neighborhood', right_on='RegionName', how='left')
crimes_and_homes = crimes_and_homes.dropna(subset=['RegionName'])
crimes_and_homes['avg_home_value'] = crimes_and_homes.iloc[:, 21:].mean(axis=1)
crimes_and_homes.head()
| nibrs_uniq | occurred_on | day_of_week | code_section | neighborhood | block_addr | city | state | zip | latitude | ... | 2023-07-31 | 2023-08-31 | 2023-09-30 | 2023-10-31 | 2023-11-30 | 2023-12-31 | 2024-01-31 | 2024-02-29 | 2024-03-31 | avg_home_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3044145_13A_2 | 2022-11-23 13:43:45 | 4 | 245 (A)(1) PC ASSAULT W/DEADLY WEAPON:NOT F/AR... | EAST VILLAGE | 14th ST & national AVE | SAN DIEGO | CA | 92113.0 | 32.705407 | ... | 6.672742e+05 | 6.704466e+05 | 6.738262e+05 | 6.772098e+05 | 6.798052e+05 | 6.796078e+05 | 6.784956e+05 | 6.778140e+05 | 6.789785e+05 | 6.617293e+05 |
| 1 | 3054619_200 | 2022-12-27 18:44:00 | 3 | 451 (D) PC ARSON:PROP (F) || | CORE-COLUMBIA | 900 10th AVE | SAN DIEGO | CA | 92101.0 | 32.714746 | ... | 1.117251e+06 | 1.124233e+06 | 1.131977e+06 | 1.138689e+06 | 1.143355e+06 | 1.141057e+06 | 1.136655e+06 | 1.133561e+06 | 1.136818e+06 | 1.106129e+06 |
| 2 | 3099629_90Z | 2023-05-23 17:26:00 | 3 | 5150 WI MENTAL DISORDER 72 HR OBSERVATION || | EAST VILLAGE | 1200 Imperial AVE | SAN DIEGO | CA | 92101.0 | 32.706329 | ... | 6.672742e+05 | 6.704466e+05 | 6.738262e+05 | 6.772098e+05 | 6.798052e+05 | 6.796078e+05 | 6.784956e+05 | 6.778140e+05 | 6.789785e+05 | 6.617293e+05 |
| 3 | 3007831_13A_2 | 2022-07-27 10:00:39 | 4 | 245 (A)(4) PC ADW WITH FORCE:POSSIBLE GBI (M) || | EAST VILLAGE | 100 14th ST | SAN DIEGO | CA | 92101.0 | 32.706341 | ... | 6.672742e+05 | 6.704466e+05 | 6.738262e+05 | 6.772098e+05 | 6.798052e+05 | 6.796078e+05 | 6.784956e+05 | 6.778140e+05 | 6.789785e+05 | 6.617293e+05 |
| 4 | 3061756_90E | 2023-01-22 20:55:00 | 1 | 647 (F) PC DRUNK IN PUBLIC: ALCOHOL, DRUGS, CO... | MIDTOWN | 1700 Hancock ST | SAN DIEGO | CA | 92101.0 | 32.741995 | ... | 1.244183e+06 | 1.262817e+06 | 1.283069e+06 | 1.301506e+06 | 1.314964e+06 | 1.319847e+06 | 1.319815e+06 | 1.321694e+06 | 1.330795e+06 | 1.227244e+06 |
5 rows × 52 columns
sd_police_stations = pd.read_csv("sd_police_stations.csv")
sd_police_stations.head()
| type | name | address | city | state | zipcode | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|
| 0 | San Diego Police Department | Northwestern Division | 12592 El Camino Real | San Diego | CA | 92130 | 32.947510 | -117.237892 |
| 1 | San Diego Police Department | Northeastern Division | 13396 Salmon River Rd | San Diego | CA | 92129 | 32.959229 | -117.122940 |
| 2 | San Diego Police Department | Northern Division | 4275 Eastgate Mall | San Diego | CA | 92037 | 32.876793 | -117.215607 |
| 3 | San Diego Police Department | Western Division | 5215 Gaines St | San Diego | CA | 92110 | 32.764118 | -117.194527 |
| 4 | San Diego Police Department | Eastern Division | 9225 Aero Dr | San Diego | CA | 92123 | 32.809448 | -117.129036 |
%matplotlib inline
sns.set_theme(context='notebook',
style='white',
font_scale=1.5)
Missing Values¶
- How can we check missing values for each dataset:
cleaned_sd_crimes_report.csv,cleaned_sd_home_values.csv, andsd_police_stations.csv?
crimes_and_homes.isnull().sum()
nibrs_uniq 0 occurred_on 0 day_of_week 0 code_section 0 neighborhood 0 block_addr 516 city 0 state 0 zip 11 latitude 0 longitude 0 min_distance 0 closest_stop 0 la_jolla 0 RegionID 0 RegionName 0 RegionType 0 State 0 City 0 Metro 0 CountyName 0 2021-10-31 0 2021-11-30 0 2021-12-31 0 2022-01-31 0 2022-02-28 0 2022-03-31 0 2022-04-30 0 2022-05-31 0 2022-06-30 0 2022-07-31 0 2022-08-31 0 2022-09-30 0 2022-10-31 0 2022-11-30 0 2022-12-31 0 2023-01-31 0 2023-02-28 0 2023-03-31 0 2023-04-30 0 2023-05-31 0 2023-06-30 0 2023-07-31 0 2023-08-31 0 2023-09-30 0 2023-10-31 0 2023-11-30 0 2023-12-31 0 2024-01-31 0 2024-02-29 0 2024-03-31 0 avg_home_value 0 dtype: int64
sd_police_stations.isnull().sum()
type 0 name 0 address 0 city 0 state 0 zipcode 0 latitude 0 longitude 0 dtype: int64
- Why are we not going to drop missing values from
crimes_and_homes? \
As previously, stated, we are interested in merging our datasets with respect to the neighborhoods and since the longitude and latitude columns are are filled, these observations are still helpful for our analysis.
Visualizing the Datasets¶
In the map below, the longitude and latitude coordinates of all crime incidents from the crimes_report dataset are plotted in red. Multiple clusters can be observed on the map, with each cluster formed by crimes within 0.25 km of a trolley stop. Additionally, the coordinates for police stations are plotted in blue on the same map. This provides a helpful visualization to understand the density of crime incidents relative to police station proximity.
crime_locations = [Point(xy) for xy in zip(crimes_report['longitude'], crimes_report['latitude'])]
police_stations = [Point(xy) for xy in zip(sd_police_stations['longitude'], sd_police_stations['latitude'])]
gpd_crime = gpd.GeoDataFrame(crimes_report, geometry=crime_locations)
gpd_police = gpd.GeoDataFrame(sd_police_stations, geometry=police_stations)
sd_boundaries_fp = 'san_diego_boundary_datasd.geojson'
sd_county = gpd.read_file(sd_boundaries_fp)
ax = sd_county.plot(figsize=(10, 6), color='white', edgecolor='black')
gpd_crime.plot(ax=ax, marker='o', color='red', markersize=12, label='crime locations')
gpd_police.plot(ax=ax,marker='o', color='blue', markersize=12, label='police stations')
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Crime and Police Station Locations in San Diego")
plt.legend()
plt.show()
Exploring Relationships Between Rates of Crime Activity at Each Trolley Stop¶
To uncover potential correlations between property values and police station proximity to crime rates, it is important to first understand the distribution of crime rates at each trolley stop. The horizontal bar chart below illustrates this distribution. For each trolley stop on the Blue Line, the chart shows the corresponding number of criminal incidents.
plt.figure(figsize=(8, 7))
crimes_per_stop = ((crimes_and_homes.groupby(['closest_stop', 'la_jolla'])['min_distance'].count()
.sort_values())
.reset_index()
.rename(columns={'min_distance': 'crime_counts'}))
sns.barplot(x=crimes_per_stop['crime_counts'], y=crimes_per_stop['closest_stop'])
plt.title('Number of Criminal Incidents Per Trolley Stop')
plt.xlabel('Crime Incident Counts')
plt.ylabel('Trolley Stop')
Text(0, 0.5, 'Trolley Stop')
The horizontal bar plot above is useful for visualizing the distribution of crime activity at each individual trolley stop. However, because our analysis specifically focuses on the disparities between trolley stops within and outside of La Jolla, we additionally highlighted these distinctions in the bar chart below. Grouping trolley stops into the categories 'True' and 'False' for whether they fall within La Jolla, the bar chart belows depicts the average number of criminal incidents within each category. The visualization reveals a significant difference in average criminal activity.
# count the number of La Jolla and non-La Jolla stops and store in a series
n_lj_stops = crimes_and_homes[crimes_and_homes['la_jolla'] == True]['closest_stop'].nunique()
n_non_lj_stops = crimes_and_homes[crimes_and_homes['la_jolla'] == False]['closest_stop'].nunique()
num_stops = pd.Series([n_lj_stops, n_non_lj_stops], index=[True, False])
# the plot below compares the average number of crime rates for La Jolla vs. non-Jolla stops on the Blue Line
avg_crimes = crimes_and_homes.groupby('la_jolla')['min_distance'].count() / num_stops
plt.figure(figsize=(6, 4))
sns.barplot(x=avg_crimes.index, y=avg_crimes.values)
plt.title('Average # of Criminal Incidents Per Trolley Stop For La Jolla vs. Non-La Jolla Trolley Stops', fontsize=12)
plt.xlabel('Is La Jolla?')
plt.ylabel('Crime Incident Counts')
Text(0, 0.5, 'Crime Incident Counts')
After understanding the distribution of criminal activity for both individual trolley stops and La Jolla vs. non-La Jolla stops, we want to explore relationships between variables that may correlate with these patterns. Specifically for our analysis, we wish to examine how factors such as property values and proximity to police stations correlate with the observed crime rates.
For property value, we used our Zillow dataset to determine the average home values for neighborhoods surrounding trolley stops both within and outside of La Jolla. The relationship between the number of crimes and the average home for each trolley stop is depicted in the scatterplot below. To differentiate between La Jolla and non-La Jolla stops, the stops are marked with different colors indicating their respective categories. From this plot, we can observe a potential relationship between property value and crime, although the relationship does not appear to be significant.
# find the average home value surrounding each trolley stop
avg_home_value_per_stop = crimes_and_homes.groupby('closest_stop')['avg_home_value'].mean()
avg_home_and_crimes = pd.DataFrame({'avg_home_value_per_stop': avg_home_value_per_stop,
'crimes_per_stop': crimes_per_stop.set_index('closest_stop')['crime_counts']})
la_jolla_home_and_crimes = avg_home_and_crimes.loc[['uc san diego health la jolla', 'nobel drive'], :]
# the scatterplot below shows the relationship between the # of crimes and the average home values for each trolley stop on the Blue Line
# La Jolla trolley stops are highlighted red
plt.figure(figsize=(8, 7))
plt.scatter(avg_home_and_crimes['crimes_per_stop'], avg_home_and_crimes['avg_home_value_per_stop'], label='Non-La Jolla stops')
plt.scatter(la_jolla_home_and_crimes['crimes_per_stop'], la_jolla_home_and_crimes['avg_home_value_per_stop'], color='purple', label='La Jolla Stops')
plt.xlabel('number of crimes')
plt.ylabel('average home value')
plt.title('Number of Crimes and Average Home Values For Each Trolley Stop', fontsize=14)
plt.legend(fontsize=12)
plt.show()
Similarly, we generated a scatter plot to illustrate the relationship between police station proximity and crime activity. The scatter plot is shown as a function of proximity to closest police station in kilometers and number of crimes. La Jolla and non-Jolla stops are differentiated for our analysis and there does not appear to be an intuitive relationship depicted by the plot.
# this function is used to find the distance of the closest police station to each trolley stop
police_station_coordinates = list(zip(sd_police_stations['latitude'], sd_police_stations['longitude']))
police_station_coordinates
def min_distance_to_police(lat1, lon1):
R = 6371.0 # radius of the Earth in kilometers
stations = police_station_coordinates
distances = []
# convert latitude and longitude from degrees to radians
for station in police_station_coordinates:
lat2, lon2 = station[0], station[1]
lat_radians1, lon_radians1, lat_radians2, lon_radians2 = map(radians, [lat1, lon1, lat2, lon2])
# Haversine formula
dlat = lat_radians2 - lat_radians1
dlon = lon_radians2 - lon_radians1
a = sin(dlat / 2)**2 + cos(lat_radians1) * cos(lat_radians2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distances.append(R * c) # distance in kilometers
min_distance = min(distances)
return min_distance
# a series of each trolley stop on the Blue Line and the distance to the closest police station in kilometers
dist_to_police = pd.Series({stop: min_distance_to_police(location[0], location[1]) for stop, location in BL_STOP_LOCATIONS.items()})
# a dataframe containing the trolley stop, the number of crimes recorded, and closest distance to a police station
police_and_crime = (pd.DataFrame(dist_to_police).merge(pd.DataFrame(crimes_per_stop.set_index('closest_stop')['crime_counts']), left_index=True, right_index=True, how='right'))
police_and_crime.columns = ['min_dist_to_police', 'num_crimes']
police_and_crime['la_jolla'] = police_and_crime.index.to_series().apply(lambda x: x in LA_JOLLA_STOPS)
plt.figure(figsize=(6, 4))
plt.scatter(police_and_crime['min_dist_to_police'], police_and_crime['num_crimes'], label='Non-La Jolla stops')
plt.scatter(police_and_crime[police_and_crime['la_jolla'] == True]['min_dist_to_police'],
police_and_crime[police_and_crime['la_jolla'] == True]['num_crimes'], label='La Jolla stops')
plt.xlabel('distance to police station (km)')
plt.ylabel('number of crimes')
plt.title('The Relationship Between Distance to Police Station and # of Crimes For Each Trolley Stop', fontsize=14)
plt.legend(fontsize=12)
<matplotlib.legend.Legend at 0x3068b0d50>
Exploring Changes over Time¶
Another interesting relationship we want to examine is how property values change over time and whether this trend correlates with fluctuations in crime rates. The line chart below highlights this change, starting from January of 2022. La Jolla vs. non-La Jolla property values are differentiated by color, with La Jolla home values significantly higher. Also on the line chart is the average number of recorded crimes per month surrounding La Jolla and non-Jolla trolley stops. Non La Jolla crime activity is observed to fluctuate more but there does not appear to be any relationships.
# home values over time for La Jolla and non-La Jolla stops
neighborhood_home_values = (crimes_and_homes[crimes_and_homes.columns[np.r_[13, 21:crimes_and_homes.shape[1]-1]]].groupby('la_jolla').mean()).T
neighborhood_home_values.index = neighborhood_home_values.index = pd.to_datetime(neighborhood_home_values.index)
# crime rates over time for La Jolla and non-La Jolla stops
lj_crimes_and_home = crimes_and_homes[crimes_and_homes['closest_stop'].isin(LA_JOLLA_STOPS)]
non_lj_crimes_and_home = crimes_and_homes[~crimes_and_homes['closest_stop'].isin(LA_JOLLA_STOPS)]
lj_monthly_crimes = lj_crimes_and_home.resample('M', on='occurred_on').count()
non_lj_monthly_crimes = non_lj_crimes_and_home.resample('M', on='occurred_on').count()
avg_non_lj_monthly_crimes = non_lj_monthly_crimes['city'] / n_non_lj_stops
avg_lj_monthly_crimes = lj_monthly_crimes['city'] / n_lj_stops
fig, ax1 = plt.subplots(figsize=(14, 8)) # figsize for declutter
# Plot home values over time for both categories (dashed)
sns.lineplot(x=neighborhood_home_values.index, y=neighborhood_home_values[True], ax=ax1,
color='green', linestyle='--', label='La Jolla Home Values') # Exclude from legend
sns.lineplot(x=neighborhood_home_values.index, y=neighborhood_home_values[False], ax=ax1,
color='#90EE90', linestyle='--', label='Non-La Jolla Home Values') # Exclude from legend
ax1.set_ylabel('Home Values (Millions)')
ax1.set_xlabel('Time')
# Plot crime rates over time for both categories (regular)
ax2 = ax1.twinx() # Create a second y-axis for the crime rates
crime_line_lj = sns.lineplot(x=avg_lj_monthly_crimes.index, y=avg_lj_monthly_crimes.values, ax=ax2,
color='red', label='La Jolla Crime Rates')
crime_line_nonlj = sns.lineplot(x=avg_non_lj_monthly_crimes.index, y=avg_non_lj_monthly_crimes.values, ax=ax2,
color='#FFB3B3', label='Non-La Jolla Crime Rates')
ax2.set_ylabel('Average Number of Incidents per Month')
# 4 line legend in the center
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
ax2.legend(lines_1 + lines_2, labels_1 + labels_2, loc='best', frameon=False)
ax1.get_legend().remove() # hide
plt.title('Change in Mean Housing Values and Crime Rates for La Jolla vs. Non-La Jolla Neighborhoods', fontsize=14)
plt.tight_layout()
plt.show()
Comparing between La Jolla versus non-La Jolla region over time, we see the following insight. For the La Jolla region, higher average home values is paired with lower crime rates. Meanwhile, lower average home values is paired with higher crime rates.
Understanding Significance: Permutation Test¶
While the figures above intuitively depict a significant difference in the frequency of criminal incidents at La Jolla and non-Jolla trolley stops, conducting a statistical test would provide a rigorous method to determine whether this observed difference is statistically significant. Here, we can use a permutation test to examine whether the observed difference can be explained by random chance.
Our hypothesis is the same as previously defined:
Null hypothesis: There is no significant difference in the rates of criminal incidents between La Jolla trolley stops on the Blue Line and non-La Jolla trolley stops within San Diego.
Alternate hypothesis: La Jolla trolley stops on the Blue Line experience lower rates of criminal incidents compared to non-La Jolla trolley stops within San Diego.
The significance level is 5%.
The test statistic for this permutation test is the difference in average crime incidents per stop for La Jolla and non-La Jolla trolley stops. If La Jolla and non-La Jolla trolley stops were to come from the same distribution, it would mean that there is no actual difference in the rates of criminal incidents between them. Therefore, randomly shuffling the labels in the la_jolla column would result in test statistics centered around zero. However, if there is a significant difference in the rates of criminal activity, a permutation test would yield a statistically significant p-value, confirming the presence of a true difference between the two locations.
# calculate the observed statistic:
# avg crime per stop in La Jolla - # avg crime per stop outside La Jolla
avg_crimes = crimes_per_stop.groupby('la_jolla')['crime_counts'].mean()
obs_stat = avg_crimes[True] - avg_crimes[False]
print("If the observed stat is negative, then that means the average crime rates in La Jolla trolley stop is smaller than Non-La Jolla Trolley stops. obs_stat:", obs_stat)
# define a function to compute the test statistic
def diff_of_avg_crime(shuffled_df):
shuffled_means = shuffled_df.groupby('la_jolla')['crime_counts'].mean()
return shuffled_means[True] - shuffled_means[False]
# define a function to perform permutation test
def simulate_stat(df):
shuffled = df.copy()
shuffled['la_jolla'] = np.random.permutation(shuffled['la_jolla'])
return diff_of_avg_crime(shuffled)
# simulate 1000 samples
test_stats = [simulate_stat(crimes_per_stop) for i in range(1000)]
p_value = (pd.Series(test_stats) <= obs_stat).mean()
p_value
If the observed stat is negative, then that means the average crime rates in La Jolla trolley stop is smaller than Non-La Jolla Trolley stops. obs_stat: -546.38
0.015
After conducting the permutation test, we obtained a resulting p-value that falls below our defined significance level of 0.05, indicating that there is a statistically significant difference in the rates of criminal incidents between La Jolla and non-La Jolla trolley stops on the Blue Line. Therefore, the difference in criminal incidents rates cannot be attributed to random chance alone. While our anaylsis observed correlation with property values and crime rates surrounding trolley stops, further exploration must be done to determine if the relationship is casual. Additionally, this leaves room for future spatial-temporal analysis to discover other potential contributing factors.
Exploration of the Different Types of Crime and their Frequencies¶
For additional curiousities, we wanted to examine what types of crimes are the most common at each trolley stop. First, we generated a bar plot to display crime types and their frequency as the top 1 crime for each trolley stop. This will give us some insight into the dominant safety concerns and patterns of criminal activity associated with trolley stops.
def find_top_crime(ser):
return ser['code_section'].value_counts().index[0]
top_crimes_per_stop = (crimes_and_homes.groupby('closest_stop').apply(find_top_crime)
.reset_index()
.rename(columns={0: 'crime_type'}))
total_top_crimes = top_crimes_per_stop['crime_type'].value_counts()
plt.figure(figsize=(20, 10))
plt.barh(total_top_crimes.index, total_top_crimes.values)
plt.xlabel('Frequency as Top Crime Type')
plt.ylabel('Crime Type')
plt.title('Top Crime Types')
Text(0.5, 1.0, 'Top Crime Types')
From this plot, we can see that code section: 5150 WI Mental Disorder is by far the most commonly observed crime type, with it being the top crime type at ten different trolley stops. Additionally, we can see that other prevalent crime types at trolley stops include 487 PC Grand Theft, 10851 VC Vehicle Theft, and 594 PC Vandalism.
code_section_counts = crimes_and_homes['code_section'].value_counts()
top_10_code_sections = code_section_counts.sort_values(ascending=False).head(10)
top_10_code_sections
plt.figure(figsize=(20, 10))
bars = plt.barh(top_10_code_sections.index, top_10_code_sections.values)
plt.xlabel('Crime Types')
plt.ylabel('Frequency')
plt.title('Top 10 Crime Types and Their Frequencies')
for bar in bars:
plt.text(bar.get_width(), bar.get_y() + bar.get_height() / 2,
f'{bar.get_width():.0f}', va='center', ha='left')
plt.show()
To get an understanding of the frequency of these crime types, we also plotted another bar chart to visualize the actual values of occurrence. With this graph, we can once again see that mental health related crimes dominate over others at the trolleys stops, potentially suggesting a need for support services in these areas. As for other crime types, the distribution appears relatively consistent.
Geospatial Explorations¶
We obtained the dataset of neighborhoods of San Diego from David Blackman.
Seeing that neighborhoods and crime rates are correlated, we can explore the geospatial relationship between property value and police station proximity with respect to Blue Line trolley stops. The limitation of this data set is that it does not cover Chula Vista.
- Here is a visualization of the intricacies of neighborhoods.
gdf = gpd.read_file('san-diego.geojson')
fig, ax = plt.subplots(figsize=(10, 10))
gdf.plot(ax=ax, color='lightgrey', edgecolor='black')
plt.title('San Diego Neighborhoods')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
- Normalized the neighborhood names from the gdf and merged by
neighborhoodfromcrimes_and_homes. - Plotted the geospatial relationship by neighborhood to crime rates.
- Plotted the geospatial relationsihp by average home value to crime rates.
gdf['name'] = gdf['name'].str.upper()
gdf = gdf.merge(crimes_and_homes, left_on="name", right_on="neighborhood", how="left")
gdf = gdf.dropna(subset=["nibrs_uniq"])
gdf.head(5)
| name | cartodb_id | created_at | updated_at | geometry | nibrs_uniq | occurred_on | day_of_week | code_section | neighborhood | ... | 2023-07-31 | 2023-08-31 | 2023-09-30 | 2023-10-31 | 2023-11-30 | 2023-12-31 | 2024-01-31 | 2024-02-29 | 2024-03-31 | avg_home_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EGGER HIGHLANDS | 2 | 2013-02-16 03:11:02.653000+00:00 | 2013-02-16 03:11:19.974000+00:00 | MULTIPOLYGON (((-117.10971 32.60428, -117.1082... | 3181862_90Z | 2024-02-14 17:30:46 | 4.0 | BW-F ZZ FELONY BENCH WARRANT (OUR AGENCY) || O... | EGGER HIGHLANDS | ... | 7.298323e+05 | 7.407001e+05 | 7.505508e+05 | 7.584084e+05 | 7.638536e+05 | 7.677207e+05 | 7.713064e+05 | 7.763904e+05 | 7.833352e+05 | 7.193012e+05 |
| 2 | OLD TOWN | 3 | 2013-02-16 03:11:02.653000+00:00 | 2013-02-16 03:11:19.974000+00:00 | MULTIPOLYGON (((-117.19062 32.75900, -117.1898... | 3025745_90Z | 2022-09-24 20:30:14 | 7.0 | BW-F ZZ FELONY BENCH WARRANT (OUR AGENCY) || | OLD TOWN | ... | 1.026016e+06 | 1.036811e+06 | 1.047452e+06 | 1.054330e+06 | 1.055928e+06 | 1.053070e+06 | 1.050726e+06 | 1.052207e+06 | 1.059987e+06 | 1.031713e+06 |
| 3 | OLD TOWN | 3 | 2013-02-16 03:11:02.653000+00:00 | 2013-02-16 03:11:19.974000+00:00 | MULTIPOLYGON (((-117.19062 32.75900, -117.1898... | 3041619_90Z | 2022-11-14 23:29:00 | 2.0 | 5150 WI MENTAL DISORDER 72 HR OBSERVATION || | OLD TOWN | ... | 1.026016e+06 | 1.036811e+06 | 1.047452e+06 | 1.054330e+06 | 1.055928e+06 | 1.053070e+06 | 1.050726e+06 | 1.052207e+06 | 1.059987e+06 | 1.031713e+06 |
| 4 | OLD TOWN | 3 | 2013-02-16 03:11:02.653000+00:00 | 2013-02-16 03:11:19.974000+00:00 | MULTIPOLYGON (((-117.19062 32.75900, -117.1898... | 2998226_23H | 2022-06-25 04:00:00 | 7.0 | 487 (A) PC GRAND THEFT:MONEY/LABOR/PROPERTY (F... | OLD TOWN | ... | 1.026016e+06 | 1.036811e+06 | 1.047452e+06 | 1.054330e+06 | 1.055928e+06 | 1.053070e+06 | 1.050726e+06 | 1.052207e+06 | 1.059987e+06 | 1.031713e+06 |
| 5 | OLD TOWN | 3 | 2013-02-16 03:11:02.653000+00:00 | 2013-02-16 03:11:19.974000+00:00 | MULTIPOLYGON (((-117.19062 32.75900, -117.1898... | 3131896_290 | 2023-06-21 22:55:00 | 4.0 | 594 (B)(1) PC VANDALISM ($400 OR MORE) (F) || | OLD TOWN | ... | 1.026016e+06 | 1.036811e+06 | 1.047452e+06 | 1.054330e+06 | 1.055928e+06 | 1.053070e+06 | 1.050726e+06 | 1.052207e+06 | 1.059987e+06 | 1.031713e+06 |
5 rows × 57 columns
ax = gdf.plot(column="neighborhood", cmap="OrRd", edgecolor="k")
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Heatmap by Neighborhood')
plt.show()
ax = gdf.plot(column="avg_home_value", cmap="OrRd", edgecolor="k")
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Heatmap by Average Home Value')
plt.show()
There appears to be an inverse relationship when looking at the heatmap by neighborhood versus by average home values in La Jolla. This tells us that the criminal incidents by neighborhood is low because of the lighter color, and the average home values is high because of the darker color.
Ethics & Privacy¶
In obtaining the relevant data for our analysis, there are several important considerations regarding ethics and privacy. In general, most of our datasets—San Diego Metro Transit records, property values, education rankings—are publicly available records and we do not anticipate any privacy issues. However, biases may arise in other aspects of our data, including our population of interest: people who utilize the San Diego trolley. Several factors influence the frequency of trolley usage, from socioeconomic to demographic to geographic. These variances shape individual interactions with the trolley systems and may also impact the behaviors and experiences we observe, leading to bias. We aim to be transparent about this in our analysis. Another dataset that may raise a privacy concern is the San Diego NIBS Police Report Data. This dataset contains specific information about crime offenses and addresses that could be considered sensitive. We plan to mitigate this issue by properly handling the data so that privacy is prioritized.
One main ethical concern for the findings of our project is the potential stigmatization of areas that may be revealed to experience higher rates of criminal activity. Because our project concerns communities and neighborhoods, we understand that results must be articulated carefully and responsibly. We will also maintain transparency of the analysis techniques and methodologies used to construct our results.
Discussion and Conclusion¶
Our analysis using the datasets we collected, cleaned, and enriched, reveals a clear conclusion that there is a significant difference in crime rates between La-Jolla Blue Line Trolley stops vs Non-La Jolla Blue Line Trolley stops. With a defined significance level of 0.05, our permutation test revealed a p-value that falls below the threshold. Therefore, we fail to reject the null hypothesis, suggesting that the difference in criminal incidents at La Jolla and non-La Jolla trolley stops on the Blue Line cannot be explained by random chance alone. This aligns with the main hypothesis of our analysis.
While carrying out our analysis, we also performed exploratory data analysis to draw connections between crime rates and other characteristics. After getting a general overview of the location of crime incidents and police stations, we proceeded to analyze the relationships between crime rates at each stop, crime rates and home value property values over time, different crimes and their frequencies, and geospatial explorations between crime incidents and property values.
The findings from our research reveal disparities in crime rates, with Non-La Jolla stops experiencing considerably higher crime compared to La Jolla stops. This study has potential implications for public safety, resource allocation, and community trust in public transportation. In addition to the disparity in crime rates, we found that the most prominent crimes are related to mental health issues among individuals using the public transportation system. This finding underscores the critical need for targeted interventions and enhanced mental health support to address this concern.
These insights call for change. Policymakers and urban planners can use this data to push for investments in safety measures, especially in Non-La Jolla stops, by improving infrastructure, surveillance, and emergency response capabilities to help reduce crime rates and foster more secure transit. By making changes and raising awareness, this could reduce stigma and promote vigilance for the city’s public transport which will ultimately contribute to a more equitable and secure transit system.
Team Contributions¶
William Heng scouted for datasets, finding publicly available police records which were integral. He worked on enriching the datasets by adding geographical coordinates as well as graphical and descriptive analysis for crime patterns, for types of crimes, and mapping police stations. He also worked on the Abstract, Team video, Team Expectations and refined the Background sections.
Vuong Bui suggested the initial research question which was refined through team discussion and did research into prior related studies to develop the Background as well as Ethics and Concerns sections to contribute to our study’s credibility, reliability, and societal impacts. He also worked on describing datasets, creating graphical visualizations, along with creating and organizing the presentation slide-deck used for the team video.
Emily Chen initiated team meetings, recorded what to do for each member and as a group, collected the home value dataset with Sharon as well as the San Diego police station dataset. She also wrote background information about the dataset with Sharon and performed geospatial analysis on property value and crime incidents. Additionally, she helped merge everyone’s code in GitHub.
Sharon Ni performed the majority of the data cleaning and data wrangling to ensure the accuracy of datasets for analysis. She also carried out exploratory data analysis and generated graphs to illustrate trends in the dataset. She also conducted the permutation test and completed the Ethics and Privacy section of the write-up.
Rina Pecherskaya refined various aspects of our project, providing feedback to enhance its quality and coherence. She conducted a side analysis on demographic variables and crime rates, which informed our research phase, and played a crucial role in decluttering final plots and editing our presentation slides based on grader feedback, ensuring clear and effective communication.