May 18, 2020 - This tutorial is still being actively edited. Please wait for this message to disappear before completing or printing, if you'd like to view it in its final form.


TABLE OF CONTENTS

This guide was created by the staff of the GIS/Data Center at Rice University and is to be used for individual educational purposes only. The steps outlined in this guide require access to ArcGIS Pro software and data that is available both online and at Fondren Library.The following text styles are used throughout the guide:Explanatory text appears in a regular font.
  1. Instruction text is numbered.
  2. Required actions are underlined.
  3. Objects of the actions are in bold.
Folder and file names are in italics.Names of Programs, Windows, Panes, Views, or Buttons are Capitalized.'Names of windows or entry fields are in single quotation marks.'"Text to be typed appears in double quotation marks."

The following step-by-step instructions and screenshots are based on the Windows 10 operating system with the Windows Classic desktop theme and ArcGIS Pro 2.1.3 software. If your personal system configuration varies, you may experience minor differences from the instructions and screenshots.

Install the R-ArcGIS bridge

Scenario:

As a GIS analyst at the Houston Police Department, you're looking for a new way to gather insights and information about the spatial and temporal trends of criminal activity in the city. Using ArcGIS Pro, you'll find hotspots of crime and analyze demographic data. Then, you'll use the R-ArcGIS bridge to transfer the data to R to test the department's theories about the factors that affect unlawful behavior and the areas where higher numbers of crimes are likely to occur.

1. Download R and RStudio

First of all, you need to download and set up R and RStudio (RStudio is a free integrated development environment for R). For both, accept all defaults in the installation wizard.

  1. R link (recommend to download R 3.2.2 or later): https://cran.mtu.edu/bin/windows/base/
  2. R Studio link: https://rstudio.com/products/rstudio/download/

2. Prepare data on Houston crime statistics

Though you can Google search Houston crime GIS data and find the shapefile (with the link: https://cohgis-mycity.opendata.arcgis.com/datasets/hpd-nibrs-crime), it only contains around 4,000 cases but for multiple years in total, which is far from complete (and it is hard to see the space-time pattern because the cases are only on several years and dates). 

As a remedy, we need to download the data from Houston Crime Statistics (with the link: https://www.houstontx.gov/police/cs/Monthly_Crime_Data_by_Street_and_Police_Beat.htm). For example, we can focus on the data of year 2020 till end of May (see the following screenshot).

 

Note that the data is in .xlsx format. We need to save it as .csv file and then geocode the addresses to make it a shapefile. You need to use credits for geocoding in ArcGIS Pro.

For convenience, I have already downloaded the data (and cleaned a little in terms of addresses) and geocoded it into a shapefile. You can download the .rar file houston-crime-sample.rar and proceed.

Notice that, for simplicity, I only use the crime cases of the first 10 days of year 2020, that is, dating from January 1, 2020 to January 10, 2020 (I choose 10 days because the step in Part II using Create Space Time Cube by Aggregating Points requires at least 10 time step intervals (in this case, it requires at least 10 days if the time step interval is set as 1 day).

3. Create a new project in ArcGIS Pro

  1. As mentioned above, download the houston-crime-sample.rar file.
  2. Locate the downloaded file on your computer and extract its contents to a folder named houston-crime-sample in a location of your choice.
  3. Open the houston-crime-sample folder and you will see houston-crime.gdb, the geodatabase which has crime data that you will add to a map.
  4. Start ArcGIS Pro. Under New, click Map. In the Create a New Project window, for Name, type Houston Crime Analysis. For Location, browse to and choose your houston-crime-sample folder. Uncheck Create a new folder for this project. Click OK, then the project is created.
  5. In the Catalog pane, on the Project tab, expand Folders and expand the houston-crime-sample folder. Expand the houston-crime.gdb geodatabase, right-click the Houston_Crimes_Sample feature class, and choose Add To Current Map.


As described above, this map shows locations where crimes occurred from January 1, 2020 through January 10, 2020 in the greater Houston area. This is only a small sample for crime analysis, but the general procedure is the same. You can apply this tutorial for larger data sets of crime statistics.

4. Install the R-ArcGIS bridge: automatical method/manual method

R-ArcGIS bridge is a useful tool for you to reading and writing data to and from ArcGIS Pro and R. Once it is installed, you can also begin running script tools that reference an R script.Here are two methods to install R-ArcGIS bridge, automatical or manual.

Automatical method:

On the ribbon, click the Project tab.Click Options.

In the Options pane, under Application, click Geoprocessing. In the R-ArcGIS Support section, select your desired R home directory. Notice that all versions of R that are installed on your computer will appear in the list. Select R 3.2.2 or a later version. For example shown in the screenshot below, I use R 4.0.1.

If you do not have the ArcGIS R integration package installed, next to Please install the ArcGIS R integration package, click Install package and choose Install package from the Internet. When asked to confirm the installation, click Yes, and when the installation is complete, click Close.

If you already have the ArcGIS R integration package installed, next to Installed 'arcgisbinding' package version, click the Check for updates button and choose Check package for updates to ensure that you have the latest version of the package.

In the Options window, click OK. Click the Back button to return to the map.


Manual method:

Sometimes it turns out error messages when installing the ArcGIS R integration pacakge. You can use the manual method describled as following.

Go to this website: https://github.com/R-ArcGIS/r-bridge-install (actually you can also refer this tutorial for manual installation).

Download the repository r-bridge-install-master.zip by clicking Code and then Download ZIP.

Download the latest version of the arcgisbinding package in this website: https://github.com/R-ArcGIS/r-bridge/releases/tag/v1.0.1.239. As of writing, this is arcgisbinding_1.0.1.239.zip (this is the updated version by the end of June, 2020).

Copy both zip files onto the directory of your choise. Extract the r-bridge-install-master.zip file. Place the arcgisbinding_1.0.1.239.zip into the same directory as the R Integration Python toolbox.

Back to ArcGIS Pro, in the Catalog >Project pane, right-click Toolboxes > Add Toolbox and navigate to the location of the R Integration Python toolbox. Open the toolbox, which should look like the following screenshot:


Run the Install R bindings script. You can then test that the bridge is able to see your R installation by running the Print R Version and R Installation Details tools.

Basic statistical analysis

1. (Optional) Project the shapefile

It is optional because the shapefile I have prepared for you has already been projected. If you download the original data in .xlsx format and geocode addresses by yourself, you need to project the generated shapefile after geocoding addresses.

In the Geoprocessing pane, search Project in the Find Tools box and click Project to open the tool. Change the following parameters:

  • For Input Dataset or Feature Class, choose Houston_Crimes_Sample (or the name you created by your own for output feature class of geocoding addresses).
  •  For Output Dataset or Feature Class, name it as Houston_Crimes_Sample_Projected as an example.
  •  For Output Coordinate System, click the globe icon and choose Projected coordinate system > World > WGS 1984 World Mercator.

 Click OK. It should look like the following:

Click Run in the right-lower corner and it starts projecting and will generate the projected feature class.

2. Aggregate point data by counts within a defined location

This step helps us understand the data further. Before starting the analysis, you need to aggregate crime counts by space and time. Aggregation reveals the spatial and temporal relationships in your data that may not have been visible previously. Specifically, aggregating allows you to summarize your crime points in space-time bins that combine the crimes that have occurred into counts by space and time increments of your choice.

Open the Geoprocessing pane, search Create Space Time Cube in the Find Tools box and click Create Space Time Cube By Aggregating Points to open the tool. Change the following parameters:

  • For Input Features, choose Houston_Crimes_Sample.
  • For Output Space Time Cube, browse to your houston-crime-sample folder and name the output Houston_Crimes_Space_Time_Cube.nc.
  • For Time Field, choose Occurrence Date.
  • For Time Step Interval, type 1 and choose Days.
  • For Time Step Alignment, confirm that End time is chosen.
  • For Aggregation Shape Type, choose Hexagon grid.
  • For Distance Interval, type 1 and choose Miles.

It should look like the following screenshot:

These parameter values specify the size and shape of the space-time bins that you are creating. Because your data is for the first 10 days of year 2020, analyzing crimes by each day is a natural breaking point. Additionally, your department wants to analyze crimes at a local level, so you select a small distance interval value (here 1 mile). Hexagon bins are selected because they are preferable in analyses that include aspects of connectivity or movement paths.

Click Run. The Create Space Time Cube By Aggregating Points tool creates a netCDF file (.nc), which allows you to view spatial patterns and trends over time. The tool aggreagated the 6,190 points in the Houston_Crimes_Sample layer into 762 hexagons (the polygon bins). The Distance Interval and Time Step Interval parameters impact the number of resulting bins and the size of each bin. These values can be chosen based on prior knowledge of the analysis area, or the tool will calculate values for you based on the spatial distribution of your data. You can confirm that this tool successfully created the file by checking the houston-crime-sample folder.

3. Analyze crime hot spots: One step forward

Next, you'll analyze where statistically significant clusters of crime are emerging and receding throughout the city. Your analysis will help the department anticipate problems and evaluate the effectiveness of resource allocation for their crime prevention measures.

In the Geoprocessing pane, click the Back button. Search for and open the Emerging Hot Spot Analysis tool. Change the following parameters:

  • For Input Space Time Cube, browse to and choose the Houston_Crimes_Space_Time_Cube.nc file.

  • For Analysis Variable, choose COUNT.

  • For Output Features, browse to your houston-crime-sample folder and name the output Houston_Crimes_Hot_Spots.shp.

It should look like the following screenshot:


By using the default value for Neighborhood Distance, you are letting the tool calculate a distance band for you based on the spatial distribution of your data. The Neighborhood Time Step value is set to one time step interval (one day in this case) by default. These settings are ideal for an exploratory analysis; however, if you knew the optimal distance band and time step interval for your analysis, you could set them.

Click Run. The tool runs and its results are added to the map. (A warning message informs you of the value that the tool used for the Neighborhood Distance parameter.)

In the Contents pane on the left, turn off the Houston_Crime_Sample layer. You will see:


Trends in statistically significant hot and cold spots are shown on the map. Red areas indicate that over time there has been clustering of high numbers of crime, and blue areas indicate that over time there has been clustering of low numbers of crime. Each location is categorized based on the trends in clustering over time.

The dark red hexagon bins are persistent hot spots. These are locations that have been statistically significant hot spots for 90 percent of all of your time slices. However, these locations do not have a discernable increase or decrease in the intensity of clustering of crime counts over time.

In contrast, the light red with beige outlined hexagon bins are intensifying hot spots. These are locations that have been statistically significant hot spots for 90 percent of all of your time slices. In addition, these are locations where the intensity of clustering of crime counts is increasing over time, and that increase is statistically significant.

Conversely, the dark blue bins are persistent cold spots. These are areas where crime is statistically, and persistently, less prevalent. The light blue outlined bins are intensifying cold spots but means the opposite of its counterpart. Clusters of low crime counts in these cells are becoming more intense over time. In other words, the cold spots are getting colder.

The department needs to be especially concerned about the areas where crime is persistent or intensifying. They may move resources to these areas from the places where crime cold spots occur.

Save your project.

Till this step, you have installed the R-ArcGIS bridge, prepared your data for statistical analysis, and started using some of the available tools. Next, you'll add additional attributes to your dataset, allowing you to draw conclusions from your analysis about what factors likely influence the occurrence of crime.

Enhance the dataset with additional attributes

Previously, you installed the R-ArcGIS bridge and downloaded the data for your statistical analysis. Then, in ArcGIS, you aggregated your data based on areas and times of interest and began to explore temporal trends in your dataset. For the department to better understand what factors influence the prevalence of crime, you'll add additional information.

1. Add additional attributes to the original dataset

Now that you know where crime hot spots are emerging, you'll try to determine why they are emerging. In particular, you'll examine the relationship between an area's crime and its population. Statistical analysis can determine if the number of crimes occurring in a particular area is influenced by population. In addition, your department is interested in analyzing the presence of certain types of businesses, as well as the prevalence of parks, the amount of public land in a given area (hexagon bins), the median household income and home value, among other factors.

Currently, the hexagon bins in the space time cube layer contain no attribute information suitable for this kind of analysis. You'll run another geoprocessing tool to enrich the layer with relevant attribute information.

Open your Houston Crime Analysis project in ArcGIS Pro.

On the Analysis tab, in the Geoprocessing group, click Environments. In the Environments window, scroll down to the Fields section. Uncheck Maintain fully qualified field names.

Click OK.

Open the Geoprocessing pane, seach for and open the Enrich tool. Notice that this step requires ArcGIS service credits also (approximately no more than 50). If you do not have sufficient credits, I have prepared the enriched feature class already, which is named Houston_Crimes_Sample_Enrich. To see this, in the Catalog pane, on the Project tab, expand Folders and expand the houston-crime-sample folder. Expand the houston-crime.gdb geodatabase, copy the Houston_Crimes_Sample_Enrich feature class to the default project geodatabase Houston Crime Analysis.gdb.

If you want to enrich the feature class by yourself, change the following parameters:

  • For Input Features, choose Houston_Crimes_Hot_Spots.
  • For Output Feature Class, browse to the default project geodatabase Houston Crime Analysis.gdb and name the output feature class Houston_Crimes_Sample_Enrich.
  • For Variables, click the plus button. In the Add Variable window, search for and choose the following variables and click OK:
    • 2019 Total Population (Esri 2020)
    • 2020 Median Home Value
    • 2020 Median Household Income
    • 2020 Renter Occupied HUs
    • 2020 Food & Beverage Stores Bus (NAICS)
    • 2020 Food Service/Drinking Estab Bus (NAICS)

It should look like the following screenshot:

Notice that demographic data is updated periodically, so the available variables and values may differ from those specified in the lesson. If necessary, use the most recent data.

Moreover, the specific variables you add are important because specific variable names are used in the R script you'll run later in the lesson. If your variable names are different than those shown in the example image, you'll need to edit them when you paste the R script or the line won't run.

While not an exhaustive list of the variables that could potentially be linked to crime rates, this list will provide a good start for your analysis.

Click Run. The tool runs and the result layer is added to the map.

In the Contents pane, right-click the Houston_Crimes_Sample_Enrich layer and choose Attribute Table.

Scroll to the right in the attribute table until you can see the fields with which you chose to enrich the layer.

The newly added enrichment fields display in the table with alias names that are more descriptive than the original field names. In the list below, alias names are listed and followed by original field names in brackets.

 The result of the Enrich tool includes the following fields and values:

  • HasData - Indicates whether the Enrich tool found data for the given hexagon bin, with 0 meaning a hexagon had no available data for all of the attributes you selected and 1 meaning a hexagon bin had data for at least one of the attributes you selected. You can use this field to filter your data so that only features with relevant attribute information appear on the map.
  • 2019 Total Population (Esri 2020) (historicalpopulation_tspop10_cy) Contains the population count per hexagon bin. Some hexagons have a population of 0. A hexagon bin may have a population of 0 because it is located in an industrial area or in a park. The first priority of your department is to reduce crimes in populated areas, so you'll focus only on populated locations.
  • 2020 Median Home Value (wealth_medval_cy) - Contains the median home value per hexagon bin.
  • 2020 Median Household Income (wealth_medhinc_cy) - Contains the median household income value per hexagon bin.
  • 2020 Renter Occupied HUs (ownerrenter_renter_cy) - Contains the number of renter occupied households per hexagon bin.
  • 2020 Food & Beverage Stores Bus (NAICS) (businesses_n13_bus) - Contains the count of food and beverage stores located within each hexagon bin.
  • 2020 Food Service/Drinking Estab Bus (NAICS) (businesses_n37_bus) - Contains the count of businesses that serve food, beverages, or both located within each hexagon bin.

You've created a feature class that contains the information needed to perform your analysis, but you now have some data that is not pertinent to your analysis goals. Hexagon bins that do not have information for your attributes of interest do not add any value or new information to help you answer your questions. Additionally, areas that are not populated are not of high priority for your department at this time. As a result, you'll need to trim down your enriched dataset to contain only the information most useful to you.

Close the Attribute Table of the Houston_Crimes_Sample_Enrich layer.

2. Further prepare the dataset

Next, you'll select the data that is relevant to your analysis and make a subset with only that information. This way, you still have access to all your enriched data should you need it for further analyses, but you can continue your current analysis with only the necessary data.

In the Geoprocessing pane, click the Back button. Search for and open the Select Layer By Attribute tool.

For Input Rows, choose Houston_Crimes_Sample_Enrich. For Selection type, confirm that New selection is chosen.

Click New Expression. Create the expression Where HasData is equal to 0.

Click Add Clause and add the expression Or 2019 Total Population (Esri 2020) is Equal to 0.


Click Run. The tool runs and selects features that have no enriched data, or that have zero population. These may be industrial sites or parks.

Open the Attribute Table for the Houston_Crimes_Sample_Enrich layer. The table indicates that 18 of 762 rows are selected, meaning they have 0 values for the HasData or 2019 Total Population (Esri 2020) fields. You'll create a new dataset without these selected features so you can focus on features that have data relevant to your analysis.

In the attribute table, click the Switch button.

The button swaps the selection from the 18 rows that had no data or no population, to all of the other rows. You should have 744 of 762 rows selected, and you can now copy the enriched and populated data to its own layer.

In the Geoprocessing pane, click the Back button. Search for and open the Copy Features tool.

For Input Features, choose Houston_Crimes_Sample_Enrich. Notice that when you have specific rows selected, the Copy Features tool only copies those rows into your new feature class result.

For Output Feature Class, browse to Houston Crime Analysis.gdb and name the output feature class Houston_Crimes_Sample_Enrich_Subset.

Click Run. Now you have two layers, Houston_Crimes_Sample_Enrich and Houston_Crimes_Sample_Enrich_Subset. The former contains the full dataset, and the latter contains only the data for areas with enriched attributes or areas with people living in them.

Close the table. On the ribbon, on the Map tab, in the Selection group, click Clear.

Save your project.

Next, you'll learn how to analyze these attributes in R, and how they may influence the likelihood an area experiences crime.

Conduct statistical analysis using R and ArcGIS Pro

Previously, you enriched your data with additional attributes, including attributes about population. Next, you'll calculate the crime rate for each location on your map. A crime rate determines how many crimes occur relative to the population. This will allow you to better compare crime counts between areas with vastly different amounts of people, as well as determine how crime rate may be influenced by the other attributes you added to your data.

While you could use the attribute table's Field Calculator in ArcGIS to determine the number of crimes per 100,000 population, you want to ensure that the crime rates you calculate are statistically robust. You'll use functions in R to smooth your crime rate.

For this analysis, you'll use the Empirical Bayes smoothing method. Empirical Bayes smoothing is a rate smoothing technique that uses the population in each of your bins as a measure of confidence in the data, with higher populations lending a high confidence. It then adjusts the areas with a lower confidence towards the mean. This technique will give the crime rates stability.

1. Bridge your data into R

Next, you'll work in RStudio to perform Empirical Bayes smoothing on your crime rates. Because you have the R-ArcGIS bridge, the data in your ArcGIS Pro project is connected to and accessible from RStudio.

Open your Houston Crime Analysis project in ArcGIS Pro. Open Rstudio.

In RStudio, in the R console, type the following code and press Enter:

install.packages("arcgisbinding", repos="http://r-arcgis.github.io/r-bridge", type="win.binary")
library(arcgisbinding)
arc.check_product()

The arc.check_product() function causes the RStudio Console to print information regarding your ArcGIS product and license.

After the arcgisbinding package has been loaded into the RStudio workspace and the connection from R to ArcGIS has been initialized, data from your current project in ArcGIS can be loaded into the RStudio workspace. Shapefiles, feature classes from geodatabases, and tables are all valid arguments to use in the open function.

Run the arc.open() function as shown in the following code block. For the argument, type the full path to your enriched data subset (Houston_Crimes_Sample_Enrich_Subset) and press Enter.

enrich_df <- arc.open(path = 'E:/Dropbox/ArcGIS/R-ArcGIS Bridge/houston-crime-sample/Houston Crime Analysis.gdb/Houston_Crimes_Sample_Enrich_Subset')

Notice that you may have saved your project data to a different location than shown in the code example. If you're copying and pasting, update the path accordingly. To specify paths in RStudio, you may need to replace back slashes (\) with a double back slash (\\). Alternatively, you can replace back slashes with a single forward slash (/).

The function stores a new arc.dataset class object in the variable enrich_df. This object contains both the spatial information and the attribute information for your ArcGIS data and can now be used in other functions. The variable is listed in RStudio under Data.

With the arc.select() function, you can choose a subset of attributes from the enrich_df object that you want to use as data for your analysis.

Run the arc.select() function as shown in the following code block. For the first argument, put enrich_df as the object from which you're making a subset. For the second argument, add a character vector containing the name of each attribute from your dataset that you want in your subset and press Enter.

enrich_select_df <- arc.select(object = enrich_df, fields = c('OBJECTID', 'SUM_VALUE', 'historicalpopulation_tspop19_cy', 'wealth_medval_cy', 'wealth_medhinc_cy', 'ownerrenter_renter_cy', 'businesses_n13_bus', 'businesses_n37_bus'))

In RStudio, the arc.select() function does not recognize field aliases and you therefore need to specify actual field names to be used in the subset. The following list shows actual fields names and associated alias names for the Houston_Crimes_Sample_Enrich_Subset feature class.

  • historicalpopulation_tspop19_cy - 2019 Total Population (Esri 202-)
  • wealth_medval_cy - 2020 Median Home Value
  • wealth_medhinc_cy - 2020 Median Household Income
  • ownerrenter_renter_cy - 2020 Renter Occupied HUs
  • businesses_n13_bus - 2020 Food & Beverage Stores Bus (NAICS)
  • businesses_n37_bus - 2020 Food Service/Drinking Estab Bus (NAICS)

Your enrich_select_df variable now contains an R data frame object with the eight attributes you selected from your full original shapefile in R. These attributes include an ID value, the crime counts for each hexagon bin, and the six attributes with which you enriched your data.

Finally, you'll convert your R data frame into a spatial data frame object using the arc.data2sp() function. A spatial data frame object is one of the spatial data classes contained in the sp package. The sp package offers classes and methods for working with spatial data such as points, lines, polygons, pixels, rings, and grids. With this function, you can transfer all of the spatial attributes from your data, including projections, from ArcGIS into R without worrying about a loss of information.

If you've never used the sp package, you need to install the sp package into your RStudio package library, and load the functions from the sp package into your workspace environment.

Run the library() function as shown in the following code block to load the sp package into RStudio.

install.packages("sp")
library(sp)

Run the arc.data2sp() function as shown in the following code block. For the first argument, use the enrich_select_df data frame as the object you are converting to an sp object.

enrich_spdf <- arc.data2sp(enrich_select_df)

Congratulations! Your data has been bridged from ArcGIS Pro to RStudio.

2. Calculate smoothed crime rates

All the information that you need to start your analysis is in place, but you may notice that the current attribute labels are cryptic and abbreviated as they represent the original field names from the Houston_Crimes_Sample_Enrich_Subset feature class . In R, you can change the names of the enriched attributes to make them easier to identify. Once that's complete, you'll perform an Empirical Bayes smoothing on the data.

Run the following code to create a character vector called col_names. For each item in the vector, provide the new attribute name with which you want to replace the current label.

col_names <- c("OBJECTID", "Crime_Counts",
               "Population", "Med_HomeValue", "Med_HomeIncome",
               "Renter_Count", "Grocery",
               "Restaurant")

Run the colnames() function as shown in the following code block. For its argument, select the data attribute of your spatial polygons data frame by using enrich_spdf@data. Assign the col_names vector that you created as the attribute label values to be assigned in place of the original variable names.

colnames(enrich_spdf@data) <- col_names

You have updated the column names for the data attribute of your spatial polygons data frame. Optionally, to check your changes, run the following command:

head(enrich_spdf@data)

The first few lines of the data attribute are displayed.

Next, you'll use the EBest() function to perform Empirical Bayes smoothing on your crime rates. The EBest() function is contained in the spdep package. As before, if you've never worked with the spdep package, you'll need to install the package before running the library(spdep) line to load the spdep library into your current workspace.

install.packages("spdep")
library(spdep)

In the console, run the following lines of code to calculate Empirical Bayes smoothed crime rates for each hexagon bin. You can either run each line individually or paste the entire code block and run all the lines at once.

n <- enrich_spdf@data$Crime_Counts
x <- enrich_spdf@data$Population
EB <- EBest (n, x)
p <- EB$raw
b <- attr(EB, "parameters")$b
a <- attr(EB, "parameters")$a
v <- a + (b/x)
v[v < 0] <- b/x
z <- (p - b)/sqrt(v)

The EBest() function performs a particular type of empirical Bayesian estimation. Generally speaking, whereas more traditional Bayesian methods work with priors before any data values have been observed, empirical Bayesian estimation is an approximation of these techniques, since the beta prior distribution used is estimated from the data. However, the EBest() function used in this lesson actually uses a modified version of empirical Bayesian estimation with parameter estimates determined by the method of moments. In the code, the variables a and b refer to the method of moments phi and gamma values, respectively. These values are the estimates of the population parameters (population in this case does not refer to the population value of the hexagon bins, but rather to the concept of sample and population data in statistics) and are what we use to perform the smoothing of the rates.

Smoothing in this case is performed by calculating the standard score, otherwise known as a z-score. This calculation is performed by subtract the population mean (our gamma value, estimated by the method of moments) from each raw crude rate value and dividing by the standard deviation. The standard deviation is calculated by taking the square root of the variance which cannot be a negative value. Hence the calculations performed regarding the variable v.

Finally, you'll add your smoothed crime rates as a new attribute to your spatial data frame object.

Run the following line of code to create a new attribute called EB_Rate to your spatial polygons data frame:

enrich_spdf@data$EB_Rate <- z

Your data now contains a new column called EB_Rate that contains the crime rate values that you calculated above for each hexagon bin.

Now that you're done in R, you'll return to ArcGIS to explore your newly created crime rates by mapping and analyzing them.

First, you'll use the arc.sp2data() function to convert your data back from an R spatial data frame object to an ArcGIS file type. You'll then use arc.write() to write your data to your project of choice.

Run the arc.sp2data() function and enter enrich_spdf as its argument. Assign the result to a new variable, such as arcgis_df.

arcgis_df <- arc.sp2data(enrich_spdf)

Your R spatial polygon data frame has now been converted back to a data frame object and can be written to your ArcGIS project. You can write your data frame object back to your ArcGIS project as a shapefile, table, or feature class in a geodatabase.

Run the arc.write() function as shown in the following code block. For the first argument, type the path to your houston-crime.gdb geodatabase and name the feature class Houston_Crime_Sample_Rates. For the second argument to the arc.write() function, enter the R object that you want to write to ArcGIS. Finally, add a third optional parameter to specify the spatial reference of the object that you're writing to ArcGIS. Notice that you may need to replace the path in the example code with the path to your houston-crime.gdb geodatabase.

arc.write('E:/Dropbox/ArcGIS/R-ArcGIS Bridge/houston-crime-sample/houston-crime.gdb/Houston_Crimes_Sample_Rates', arcgis_df, shape_info = arc.shapeinfo(enrich_df))

The spatial reference ensures that the data is projected correctly when you add it to a map.

Using R, you transformed your data into something powerful. You turned your crime counts for each hexagon bin into crime rates that account for changes in population and showed how that impacts the number of crimes that occur. You used Empirical Bayes smoothing to ensure that the rates you created were robust to the amount of information available for each location.

Next, you'll reexamine your data in ArcGIS Pro so you can visualize the trends and patterns of crime.

3. Continue analysis in ArcGIS Pro

Now that it's time to bring your R-adjusted data back into ArcGIS, you'll minimize RStudio and maximize your ArcGIS project.

Minimize RStudio and maximize ArcGIS Pro.

In the Catalog pane, right-click your houston-crime.gdb geodatabase and choose Refresh.


You can see that now the Houston_Crimes_Sample_Rates feature class, which contains the smoothed crime rates, has been included in your houston-crime.gdb geodatabase.

In the houston-crime.gdb geodatabase, right-click the Houston_Crimes_Sample_Rates feature class, point to Add To New, and choose Map.

The file is added to a map in your project and can be used for analysis.

4. Identify areas with unusually high crime rates

To see which areas in Houston have an unexpectedly high number of crimes given the number of people, you'll run another hot spot analysis. Remember that we only use 10 days' data in this example and the results only apply for such short time. If you want to improve the analysis, you need to enhance your data by using more observations covering longer time (that consumes ArcGIS credits though, as mentioned before).

The first hot spot analysis you ran identified areas with statistically significant higher numbers of crimes than expected and provided information on how the number of crimes for each location were changing over time. This hot spot analysis identifies statistically significant clusters of high and low crime rates, allowing you to see the areas with usually high numbers of crimes given the population in the area.

In the Geoprocessing pane, search for and open the Optimized Hot Spot Analysis tool. Change the paremeters as following:

  • For Input Features, choose Houston_Crimes_Sample_Rates.
  • For Output Feature Class, browse to Houston Crime Analysis.gdb and name the output feature class Houston_Crime_Sample_Rates_OHSA.
  • For Analysis Field, choose EB_Rate.


Click Run. By running this tool on the crime rates that you calculated in R contained in the EB_Rate column, the results of this tool will locate the statistically significant spatial clusters of high or low crime rate values.


On your map, the bright red hexagon bins signify areas where there is intense clustering of high values with 99 percent confidence. These are areas where there are unusually high numbers of crimes occurring even when accounting for population. Notice that once the population of each area is considered, there are no statistically significant cold spots (areas of clustering of low crime counts).

Save your project.

You've used the R-ArcGIS bridge to transfer your data into R to take advantage of functionality needed to calculate smoothed crime rates. You also transferred your data back into ArcGIS to continue your analysis and to further pinpoint areas in need of extra police resources to reduce the number of crimes occurring.

Next, you'll perform an exploratory data analysis in R to determine whether the trends in crime rates can be linked to any of the other attributes with which you enriched your data.

Identify attributes that influence crime

Previously, you created a map in ArcGIS Pro showing where there are unusually high numbers of crimes occurring adjusted to account for population. Next, you'll determine which of the other attributes you added may be influencing the prevalence of crime across Houston. To do this, you will use R to leverage exploratory data analysis tools to identify the most significant influencers on crime. The most influential attributes can be used as a starting point for the department when it's ready to begin researching possible predictive models for future planning and the proactive distribution of resources in ArcGIS and R.

1. Create a correlation matrix in R to evaluate attribute relationships

As an important first step in modeling the relationships between crime and the variables you've chosen, you can use exploratory data analysis tools in R. These tools allow you to identify the most likely statistically relevant predictors for your analysis, potentially making future models you build more effective in identifying future trends. The exploratory data analysis tool you're going to use is a correlation matrix. The correlation matrix tool creates an illustrated grid of values that measure the level of association between your added attributes and your population smoothed crime rates.

 Open your Houston Crime Analysis project in ArcGIS Pro and open RStudio. You'll bring your most recent version of the data into R.

In RStudio, run arc.open() to establish the project and dataset you are working with by specifying the path to your Houston_Crime_Sample_Rates feature class as the first argument. Store the result in the rate_df variable.

rate_df <- arc.open('E:/Dropbox/ArcGIS/R-ArcGIS Bridge/houston-crime-sample/houston-crime.gdb/Houston_Crimes_Sample_Rates')

Run arc.select() to choose the variables you want from your data to bring into R. For the first argument, specify the object from which you are selecting attributes, in this case, rate_df. For the second argument, list each attribute that you want to select in a character vector.

rate_select_df <- arc.select(rate_df, fields = c("OBJECTID", "Crime_Counts", "Population", "Med_HomeValue", "Med_HomeIncome", "Renter_Count", "Grocery", "Restaurant", "EB_Rate"))

Run the arc.data2sp() function to convert your feature class into a spatial data frame object.

rate_spdf <- arc.data2sp(rate_select_df)

To enhance the appearance of the correlation matrix, you'll load several function libraries and custom functions that can be found in the correlation matrix tutorial. You'll use the following custom functions:

  • Get lower triangle of the correlation matrix - By default, a correlation matrix returns the correlation coefficient for each pair of attributes twice. This function identifies and returns only the values for the lower triangle of your correlation matrix. All other values are set to NA.
  • Get upper triangle of the correlation matrix - By default, a correlation matrix returns the correlation coefficient for each pair of attributes twice. This function identifies and returns only the values for the upper triangle of your correlation matrix. All other values are set to NA.
  • Reorder correlation coefficients - This function reorders the correlation matrix by correlation coefficient magnitude.

These functions produce a correlation matrix that is polished, easier to analyze, and ready to be shared with the police department.

Run the following code in your RStudio console to add the custom functions to your workspace:

# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat) {
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}
#
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat) {
  cormat[lower.tri(cormat)] <- NA
  return(cormat)
}
#
reorder_cormat <- function(cormat) {
  # Use correlation between variables as distance
  dd <- as.dist((1-cormat) / 2)
  hc <- hclust(dd)
  cormat <- cormat [hc$order, hc$order]
}

Run the following code to add the functions from the reshape2ggplot2, and ggmap libraries to your workspace.

install.packages("reshape2")
library (reshape2)
install.packages("ggplot2")
library (ggplot2)
install.packages("ggmap")
library (ggmap)

Run the following code to create the correlation matrix:

corr_sub <- rate_spdf@data [ c ("Grocery", "Restaurant", "Med_HomeIncome", "Renter_Count", "Med_HomeValue", "EB_Rate")]
cormax <- round (cor(corr_sub), 2)
upper_tri <- get_upper_tri (cormax)
melted_cormax <- melt (upper_tri, na.rm = TRUE)
cormax <- reorder_cormat (cormax)
upper_tri <- get_upper_tri (cormax)
melted_cormax <- melt (upper_tri, na.rm = TRUE)
ggheatmap <- ggplot (melted_cormax, aes (Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2 (low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name = "Pearson\nCorrelation") +
  theme_minimal() + # minimal theme
  theme (axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()
print (ggheatmap)
ggheatmap +
  geom_text (aes (Var2, Var1, label = value), color = "black", size = 4) +
  theme (
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    axis.ticks = element_blank(),
    legend.justification = c (1, 0),
    legend.position = c (0.6, 0.7),
    legend.direction = "horizontal") +
  guides (fill = guide_colorbar (barwidth = 7, barheight = 1, title.position = "top", title.hjust = 0.5))

click the Plots > Zoom Plot tab to view the resulting correlation matrix.

A correlation matrix helps identify predictors you may want to focus on when deciding what attributes influence the occurrence of crime. The matrix shows values that measure how strongly correlated the attributes are with the desired dependent variable and with one another. Additionally, by identifying attributes with a higher correlation to your dependent variable, you have a better idea of what predictors to try first when attempting to find a predictive model that fits your data well. Your correlation matrix can also be used to identify possible instances of multicollinearity between predictors.

When a predictor has a positive correlation with your dependent variable, it means that as the magnitude of the predictor increases, the magnitude of the dependent variable increases as well. The higher the correlation coefficient, the stronger this relationship is. Whereas with a negative correlation, as the magnitude of the predictor increases, the magnitude of the dependent variable decreases.

Multicollinearity measures the similarity of two predictors. When a predictive model contains variables that represent the same effect, those variables tend to negate one another to the point where the effect being accounted for in the model does not impact the response as strongly, and can cause instability in the model. Both of these measures are valuable to consider as you investigate what influence the occurrence of high crime rates.

The column above and to the right of your EB_Rate variable has fairly light colors and low values indicating that it is not strongly correlated with the other attributes. In contrast, you can observe some possible instances of multicollinearity among your potential predictors. In particular, the Restaurant and Grocery attributes have strong correlation, as indicated by the bright red square with a correlation coefficient 0.71.

While the correlation matrix can offer clues about what predictors to include in your model, this type of exploratory data analysis is just one part of building a good model. For the purposes of this lesson, you'll use only the correlation matrix to help make decisions. For more information about other exploratory functions in R that are useful for developing predictive models, see this blog (at a starting level).

 

Conclusion:

Based on the results of your correlation matrix, it appears that your department still has some work to do in terms of identifying attributes that influence the occurrence of crime. Ideally, you'd like to see attributes with much larger correlation values to your response variable so you are fairly confident that a relationship exists between the two. However, this is not surprising, as the process of finding good predictors in a given data set for a particular response variable can be tricky and can require advanced statistical methods to account for possible non-linear terms, spatial trends, and other factors.

Through this lesson, you learned how to install and set up the R-ArcGIS bridge, how to use the bridge to transfer data between ArcGIS and R, and you have seen one of the possible ways R can enhance your ArcGIS workflows through its powerful statistical libraries. You've analyzed statistically significant spatiotemporal trends; enriched your data with a wealth of available socioeconomic, demographic, business, and environmental factors; calculated robust crime rates; found hot spots of crime rates; and began to explore relationships that might help explain those patterns. With the addition of the R-ArcGIS bridge to your workflow, you now have more possibilities at your fingertips than ever before to assist you and your department as you work to understand crimes in Houston and what you can do to reduce them.

 

References:

  • No labels