Conducting Geospatial Research: Organizing Databases, Documentation, and Crafting Sample SQL Queries

Geospatial Relational Databases, SQL, and Exploratory Analysis

Geospatial relational databases play a vital role in managing and analyzing geographic information and related tabular data. These databases are more than just repositories; they offer valuable tools and frameworks for organizing GIS projects, conducting research, and performing analysis. A well-structured database facilitates efficient data exploration, empowering analysts to query and investigate the information effectively. While geospatial databases are typically employed within desktop GIS software, they can also be leveraged through SQL for robust data processing and analysis. This adaptability makes geospatial relational databases crucial in geospatial research and analysis. Furthermore, geospatial relational databases provide a solid foundation for spatial data integration and interoperability. These databases allow for the seamless integration of diverse data sources. By consolidating all this information into a single database, analysts can gain a comprehensive view of the geographical landscape they are studying and make informed decisions based on evidence-driven insights.

The Importance of Congruous Data for Seamless Configuration

Ensuring the data used in a geospatial relational database is able to be connected to other data added to the database is of utmost importance for performing SQL queries. Geospatial databases store and manage location-based information, such as coordinates, addresses, and spatial relationships. In order to effectively analyze and manipulate this data, it is crucial that it can be seamlessly integrated with other data within the database. This allows for complex queries that combine geospatial information with other attributes or criteria. By establishing these connections, users can extract valuable insights and make informed decisions based on their geospatial data. Without proper data connectivity, the potential of a geospatial relational database may be limited, hindering its ability to deliver comprehensive and accurate results.

Constructing the Relational Database

CalEnviroScreen 3.0

For this project, we will source our initial dataset from the California Office of Environmental Health Hazard Assessment's CalEnviroScreen 3.0 (CES). This dataset provides valuable information on environmental factors and their potential health impacts within California communities.

California Schools 2020-21

The second data set added to the database was sourced from the California State Geoportal (Dixon, 2022) and hosts information related to California schools from the 2020-2021 academic year. This data set is an advantageous addition as it contains relevant data missing from the school data already present in the CES data. Additionally, this data requires no preprocessing for use in the relational database.

Hazardous Waste Sites

The third data set added to the database was sourced from the California State Geoportal (DTSC Admin, 2022) and hosts information related to hazardous waste sites located in California. This data set provides expansion upon the limited data for hazardous waste sites already present in the CES data and requires no data processing prior to implementation.

Additional Tables Included in Database

  • Cities and Towns

  • Hospitals

  • Schools

  • Colleges

Data Codebook

Developing Sample SQL Queries

Sample SQL queries play a crucial role in the design of databases that will be used by others. These queries provide a practical demonstration of how data can be retrieved, manipulated, and analyzed from a database. By including sample SQL queries, designers ensure that end-users understand the database's structure, its data organization, and the available functionalities. Additionally, these queries allow designers to verify the accuracy and efficiency of their database design. They help uncover potential issues, such as incorrect joins or inefficient indexing, early in the development process. Furthermore, sample SQL queries can serve as powerful learning tools for users who are new to the database, enabling them to better understand the underlying data and its relationships. Ultimately, the inclusion of well-crafted sample SQL queries enhances the usability, efficiency, and overall quality of databases, providing users with a reliable and efficient tool for their data management needs.

The easiest approach to developing sample SQL queries is to first identify realistic research or analysis questions database users would seek to resolve. Next, outline the tables and fields they would use to perform such a query. Lastly, provide an example SQL query.

Example SQL Research/Analysis Questions

Note. Tables are referred to by their database assigned names. California Schools is school2021, CalEnviroScreen is ces_tracts, and Hazardous Waste Sites is hazardouswastesites.

Which California schools have 500 students or more and are within 5 miles (8046.72 m) of a hazardous waste site?

  • Table Fields --------------------------------- ------------------------------------

    California Schools schoolname, city, geom, enrolltota

    Hazardous Waste Sites facility_n, geom

  • select schoolname school_name, school2021.city school_city, school2021.geom school_location, enrolltota school_enrollment_number, facility_n hazardous_waste_site_name

    from school2021, hazardouswastesites

    where st_intersects(school2021.geom, st_buffer(hazardouswastesites.geom, 8046.72)) and school2021.enrolltota >= 500;

How many hospitals are within a census tract that has an age-adjusted rate of ED visits for asthma per 10,000 of 61.0 or higher and toxicity-weighted concentration value of 500.00 or higher?

  • Table Fields --------------------------------- ----------------------------------------------

    Hospitals geom

    ces_tracts geom, asthma, rseihaz

  • Select count(*) hospitals from hospitals, ces_tracts

    where st_intersects(st_transform(hospitals.geom, 3310), ces_tracts.geom) and ces_tracts.asthma >= 61.0 and ces_tracts.rseihaz >= 500.00;

Which California census tracts have equal to or over 12.0 annual mean PM2.5 concentrations and at least one hazardous waste site? What city and county does each tract reside in? What was the population size of these tracts in 2010?

  • Table Fields --------------------------------- ----------------------------------------------

    hazardouswastesites geom

    ces_tracts geom, asthma, rseihaz

  • select ces_tracts.tract, ces_tracts.city, ces_tracts.california_county county, pm PM_Conc, pop2010 Population_2010

    from ces_tracts, hazardouswastesites

    where st_intersects(ces_tracts.geom, st_transform(hazardouswastesites.geom, 3310)) and pm >= 12.0

    order by pm desc;

How many California hazardous waste sites reside within census tracts where 40% or less of the population is ethnically classified as White and the linguistic isolation is 12.0% or higher?

  • Table Fields --------------------------------- ----------------------------------------------

    hazardouswastesites geom

    cestracts geom, white_pct, ling

  • select count(*) Num_of_Sites

    from hazardouswastesites, ces_tracts

    where st_intersects(st_transform(hazardouswastesites.geom,3310), ces_tracts.geom) and white_pct <= 40.0 and ling >= 12.0;

What is the percentage of California that is within 16093 meters of a college?

  • Table Fields --------------------------------- ----------------------------------------------

    colleges geom

    cestracts geom

  • select 100 * st_area(st_union(st_buffer(geom,16093))) / (select sum(st_area(geom)) from ces_tracts) pct_CA_wi_3100_m_of_college

    from colleges;

How many California elementary schools have two or more students who are classified as homeless?

  • Table Fields --------------------------------- ----------------------------------------------

    school2021 schooltype, homcount

  • select count(*) from school2021 where schooltype = 'Elementary' and homcount >= 2;

How many California colleges are in a tracts with a cleanup site sum equal to or greater than 13.1?

  • Table Fields --------------------------------- ----------------------------------------------

    colleges geom

    cestracts geom, cleanups

  • select count(*) Colleges from colleges, ces_tracts

    where st_intersects(st_transform(colleges.geom, 3310), ces_tracts.geom) and ces_tracts.cleanups >= 13.1;

Which tracts have diesel PM emissions greater than 0.20 and hospitals with less than 200 beds?

  • Table Fields --------------------------------- ----------------------------------------------

    ces_tracts geom, pm

    hospitals geom, beds

  • select tract, pm, beds from ces_tracts, hospitals

    where st_intersects(st_transform(hospitals.geom, 3310), ces_tracts.geom) and ces_tracts.pm > 0.20 and hospitals.beds < 200;

Which tracts have ozone values greater than 0.059 and at least one hazardous waste site?

  • Table Fields --------------------------------- ----------------------------------------------

    hazardouswastesties geom

    cestracts geom, ozone

  • select tract, ozone ozone_values from ces_tracts, hazardouswastesites

    where st_intersects(st_transform(hazardouswastesites.geom, 3310), ces_tracts.geom) and ces_tracts.ozone > 0.059;

What percentage of schools with a homeless percentage above 2% reside within a census tract with a poverty level of 50% or greater and a pollution burden score of 5.0 or higher?

  • Table Fields --------------------------------- ----------------------------------------------

    California Schools geom, homepct

    CalEnviroScreen geom, tract, pov, pollutions

  • Select 100 * count(*)::numeric(15,1) / (select count(id) from school2021) as pct_of_schools from school2021, ces_tracts

    where st_intersects(st_transform(school2021.geom,3310), ces_tracts.geom) and hompct >= 2.0 and ces_tracts.pov >= 50.0 and ces_tracts.pollutionscore >= 5.0;

What cities/towns have 30.0% or higher of the population over 25 with less than a high school education and how many community colleges are in these cities/towns to allow for access to continue or complete education?

  • Table Fields --------------------------------- ----------------------------------------------

    cities and towns geom, name

    ces_tracts geom, edu

    colleges geom, name, naics_desc

  • select cities_towns.name City_Name, ces_tracts.edu Education_Attainment_Pct, (select count(*) from colleges where colleges.lcity = cities_towns.name) Community_Colleges

    from cities_towns, ces_tracts, colleges

    where st_intersects(cities_towns.geom, ces_tracts.geom) and ces_tracts.edu >= 30.0

    group by cities_towns.name, ces_tracts.edu;

Which California colleges are within a census tract with drinking water contaminant index value of 642.0 or greater and a groundwater threat score of 22.40 or higher? What city is each college in?

  • Table Fields --------------------------------- ----------------------------------------------

    ces_tracts geom, drink, gwtheats

    colleges geom, name, lcity

  • select colleges.name College, colleges.lcity City, ces_tracts.drink Drink_Contam_Index, ces_tracts.gwthreats Ground_Water_Threat

    from ces_tracts, colleges

    where st_intersects(st_transform(colleges.geom,3310), ces_tracts.geom) and ces_tracts.drink >= 516.0 and gwthreats >= 16.30

    order by colleges.name asc;

What is the total area of california census tracts which contain a population greater than 10,000 and contain 1 or more hospitals?

  • Table Fields --------------------------------- ----------------------------------------------

    ces_tracts geom, pop2010

    hospitals geom

  • select sum(st_area(ces_tracts.geom))

    from ces_tracts, hospitals

    where st_intersects(ces_tracts.geom, st_transform(hospitals.geom,3310)) and ces_tracts.pop2010 > 10000;

Which ten California schools are subjected to the highest pollution burden scores (9.0 or higher)?

  • Table Fields ------------------------------- ----------------------------------------------

    ces_tracts geom, pollutions

    school2021 geom, schoolname, city, enrolltota

  • select schoolname school_name, school2021.city school_city, enrolltota school_enrollment_number, ces_tracts.pollutionscore

    from school2021, ces_tracts

    where st_intersects(st_transform(school2021.geom,3310), ces_tracts.geom) and ces_tracts.pollutionscore >= 6.0

    limit 10;

Which tract has the highest pollution score and at least one hazardous waste site?

  • Table Fields --------------------------------- ----------------------------------------------

    hazardouswastesties geom

    cestracts geom, pollutions

  • select ces_tracts.tract, ces_tracts.pollutionscore

    from ces_tracts, hazardouswastesites

    where st_intersects(st_transform(hazardouswastesites.geom,3310), ces_tracts.geom)

    order by ces_tracts.pollutionscore

    limit 1;

References

CES. (2016, December 29). CalEnviroScreen 3.0 [Text]. OEHHA. https://oehha.ca.gov/calenviroscreen/report/calenviroscreen-30

Dixon, G. (2021). California Schools 2020-21 | California Schools 2020-21. California State Geoportal. https://gis.data.ca.gov/datasets/CDEGIS::california-schools-2020-21/explore?location=36.763236%2C-119.099288%2C7.00

DTSC Admin. (2022). Hazardouswastesites. California State Geoportal. https://gis.data.ca.gov/maps/DTSC::hazardouswastesites

Previous
Previous

Designing and Documenting a Geospatial Database: LA County Crime

Next
Next

Texas Community Assessment Database