List of United States Cities by Population

ajiang·January 22, 2025

Description

Cleaned dataset of list of US Cities by Population

Summary


The data from this dataset was collected from the Wikipedia page article "List of United States cities by population" as of 10/10/2023. The data was collected through web scraping using Python and the Beautiful Soup library, as well as Pandas to convert the scraped data into a final data frame that was then exported to .csv format. The column names were established based off the Wikipedia article and adjusted to include units, and column data was collected through parsing through the Wikipedia table's HTML to find the correct tags for table data.

Processing was done through Python, including the deletion of duplicate rows, NULL entries and cleaning using the .replace command to remove units like square miles, square kilometers, percentages and brackets from table data.

Analysis was conducted through usage of the Scipy and Pyplot Python libraries for statistical analysis and data visualization respectively. Scipy was used to calculate a correlation coefficient for two variables - 2020 Census and 2020 Land Area. A scatterplot was generated using those two variables that displayed the weak correlation between the two using Pyplot. An additional boxplot was generated for 2020 Census population data to analyze its distribution and summary statistics generated.

Basic info
Author
ajiang
Shared withEveryone
CreatedOctober 10, 2023
Size51 KB
LicenseN/A
Dictionary1 tables
Original URLGo to check
Publishedimage
public datasets
Advanced features
Insights
Based on the provided information of the dataset, would it be possible to provide some relevant inquiries?
How many rows does the document contain?
What columns are included in the document?

List of United States Cities by Population

ajiang·January 22, 2025

Description

Cleaned dataset of list of US Cities by Population

Summary


The data from this dataset was collected from the Wikipedia page article "List of United States cities by population" as of 10/10/2023. The data was collected through web scraping using Python and the Beautiful Soup library, as well as Pandas to convert the scraped data into a final data frame that was then exported to .csv format. The column names were established based off the Wikipedia article and adjusted to include units, and column data was collected through parsing through the Wikipedia table's HTML to find the correct tags for table data.

Processing was done through Python, including the deletion of duplicate rows, NULL entries and cleaning using the .replace command to remove units like square miles, square kilometers, percentages and brackets from table data.

Analysis was conducted through usage of the Scipy and Pyplot Python libraries for statistical analysis and data visualization respectively. Scipy was used to calculate a correlation coefficient for two variables - 2020 Census and 2020 Land Area. A scatterplot was generated using those two variables that displayed the weak correlation between the two using Pyplot. An additional boxplot was generated for 2020 Census population data to analyze its distribution and summary statistics generated.

Scan the QR code to ask more questions about the paper
© 2025 Powerdrill. All rights reserved.