Skip to main content
Posts tagged:

data science

The Data Doesn’t Lie: How 266 Runs Revealed The Truth About My Marathon Goal

For the BMW Marathon in Dallas, I’m targeting 7:30/mile pace—a 3:16:45 finish, what training plan should I follow from today? Work has been insane stressful, and has destroyed my training. Can AI help me here?

My long runs felt good at that pace for 8-10 miles, so I thought I was on track. I’m always testing to see how much I can do myself and I’m curious to use AI to coach me. AI isn’t useful without the right data, so I analyzed 266 runs from the past two years to build a good training plan.

The most recent data is the most important, so it was important that I took note of my run this morning. I ran a half marathon at 8:10 average pace with a heart rate of 137-148 bpm—very easy aerobic effort. I finished the last two miles at 7:42 and 7:49 pace.

Here’s what 266 runs over two years revealed:

DISTANCE RANGERUNSAVERAGE PACEBEST PACEPATTERN
3-4 miles928:146:44Speed is there!
6-7 miles318:257:23Solid training pace
10-11 miles88:117:47Sub-8:00 capability proven
13-14 miles37:487:53THE SWEET SPOT
14-15 miles27:547:41Strong mid-distance
16-17 miles28:268:20Starting to fade
18-19 miles28:117:44Inconsistent
20+ miles58:548:00THE PROBLEM

The pattern was clear: 13-14 mile average of 7:48 versus 20+ mile average of 8:32. A 44-second-per-mile dropoff. My best 20+ mile run: 8:00 pace. Still 30 seconds slower than goal. My average 20+ mile run: 8:32 pace. A 1:02/mile gap to close. But this morning’s half marathon told a different story. I ran 13.1 miles at low heart rate (137-148 bpm) and finished strong at 7:42-7:49 pace. “Very easy,” I noted afterward. “Could have done a lot more.” This suggests my aerobic base is much better than the historical 8:32 suggests. That average likely came from poor pacing on long runs, not lack of fitness.

My 3-4 mile best pace: 6:26. That’s 1:04 faster than my marathon goal. The problem isn’t speed—it’s extending that speed over distance. The gap: extending my 7:48 pace from 13-14 miles to 20+ miles, then racing smart for 26.2 miles. When you define it that specifically, you can build a plan to address it.

The dropoff from 7:48 pace (13-14 miles) to 8:32 pace (20+ miles) isn’t random—it’s physiological. Research on elite marathoners shows that even well-trained runners deplete muscle glycogen stores significantly after 90-120 minutes at marathon pace. For me, 13-14 miles at 7:48 takes about 1:47—right in the window where glycogen runs low. When I push to 20 miles without proper fueling and without training my body to use fat efficiently, I hit the metabolic wall. But fuel isn’t the only issue. My legs simply aren’t conditioned for the distance.

By “legs aren’t ready,” I mean muscular endurance breaks down, neuromuscular efficiency degrades, and form deteriorates. The quads that fire smoothly at mile 10 are misfiring by mile 18. The signal from brain to muscle becomes less crisp. Motor unit recruitment—the coordinated firing of muscle fibers—gets sloppy. I’m sending the same “run at 7:45 pace” command, but my legs execute it as 8:30 pace. Meanwhile, small biomechanical breakdowns compound: hip drops slightly, stride shortens, each foot strike becomes less efficient. Running 20 miles means roughly 30,000 foot strikes. If I haven’t progressively trained my legs to absorb that cumulative pounding, my body literally slows me down to prevent injury.

Studies on elite marathon training show successful marathoners spend 74% of their training volume at easy intensity (Zone 1-2) because it builds aerobic capacity without accumulating neuromuscular fatigue. My data suggests I was probably running too many miles too hard, accumulating fatigue faster than I could recover—especially at 48. Research on masters athletes shows recovery takes 10-20% longer after age 40. Some coaches recommend 10-12 day training cycles for older athletes instead of traditional 7-day cycles, allowing more space between hard efforts. If I’m not recovering fully before the next quality session, my 20+ mile pace suffers even more than it would for a younger runner.

There’s also cardiovascular drift to consider. During prolonged running, cardiac output gradually decreases while heart rate increases to compensate. This is more pronounced at higher intensities. If I’m running long runs at or near race pace (7:30-7:50), I’m experiencing significant cardiovascular drift by mile 15-18. The effort to maintain pace increases exponentially. My 20+ mile pace of 8:32 might simply reflect the point where my cardiovascular system says “enough.”

My training strategy alternates run days with bike days—typically 18-28 miles at 15-18 mph. Research shows that cycling can maintain aerobic fitness while reducing impact stress, with one mile of running equaling approximately three miles of cycling for cardiovascular equivalence. This means my weekly aerobic load is higher than running mileage suggests: 45 miles running plus 85 miles cycling equals roughly 73 “running equivalent” miles per week. The cycling protects my legs while maintaining cardiovascular fitness—smart training at 48. But it also means my running-specific adaptations (neuromuscular patterns, impact tolerance, glycogen depletion management) might be underdeveloped relative to my aerobic capacity.

The data reveals a specific problem: I have speed (6:26 for 3-4 miles), good mid-distance endurance (7:48 for 13-14 miles), and strong aerobic fitness (cycling adds volume). But I haven’t trained the specific adaptation of holding sub-8:00 pace beyond 14 miles. This isn’t a fitness problem—it’s a specificity problem. The solution isn’t to run more miles. It’s to progressively extend the distance at which I can hold my proven 7:48 pace, while managing fatigue and recovery as a 48-year-old athlete.

Traditional marathon training plans prescribe long runs at “easy pace” or 30-60 seconds slower than race pace. That builds aerobic base, but it doesn’t address my specific limitation: I need to teach my body to hold quality pace for progressively longer distances.

This morning’s half marathon changes the starting point. Instead of beginning conservatively at 16 miles, I can start more aggressively at 18 miles next Saturday. The plan builds from there: 18 miles at 7:50 average pace (two miles easy warmup, 14 miles at 7:45-7:50, two miles easy cooldown). The goal is simple—extend this morning’s easy 13.1-mile effort to 14 quality miles. Week two pushes to 20 miles at 7:45 average, tackling my historical problem distance at a pace 47 seconds per mile faster than my 8:32 average. Week three peaks at 22 miles averaging 7:40 pace—the breakthrough workout that proves I can hold sub-7:40 pace for 18 continuous miles.

After peaking, the volume drops but the intensity holds. Week four: 16 miles at 7:35 average with 12 miles at race pace (7:30-7:35). Week five adjusts for Thanksgiving with 14 miles at 7:35. Week six is the dress rehearsal: 10 miles with six at 7:25-7:30 pace, confirming goal pace is ready. The progression is deliberate—each week either extends distance or drops pace by five seconds per mile, allowing physiological adaptation without overwhelming the system. Elite marathon training research supports this approach: progressive overload with strategic recovery.

Tuesday speedwork leverages my natural speed. My best 3-4 mile pace is 6:26—more than a minute per mile faster than marathon goal pace. Research consistently shows that running intervals 30-60 seconds faster than marathon pace improves race performance by increasing VO2 max, improving running economy, and creating a “speed reserve” that makes race pace feel controlled.

The plan starts with 8×800 meters at 6:30 pace (3:15 per repeat) with 90-second recovery jogs—establishing that I can run fast repeatedly. Week two builds to 10×800 at the same pace. Week three shifts to marathon-specific longer intervals: 6×1200 meters at 6:40 pace. Week four is a six-mile tempo run at 7:10-7:15 pace—faster than race pace, sustained effort. The final speedwork comes in week five: 6×800 at 6:25 pace for sharpness, not volume. Running 6:30 pace in workouts creates a one-minute-per-mile speed reserve over my 7:30 goal. Analysis of 92 sub-elite marathon training plans found successful programs include 5-15% high-intensity training. My Tuesday sessions provide exactly this stimulus.

At 48, recovery determines whether I arrive at race day peaked or exhausted. Complete rest days come every five to six days—no running, no cross-training, just rest. Easy runs stay at 6-7 miles at 8:20-8:30 pace, conversational effort that builds aerobic capacity without adding fatigue. Bike days alternate with run days: recovery rides of 18-22 miles at 15 mph with high cadence and low resistance, or moderate rides of 22-28 miles at 16 mph for steady aerobic work. The cycling maintains cardiovascular fitness and increases blood flow to running muscles while reducing impact stress. Research on masters runners consistently emphasizes that recovery adaptations—not just training adaptations—determine race day performance for athletes over 40.

Fueling practice matters as much as the miles themselves. My data shows pace dropping significantly after 90-120 minutes, suggesting glycogen depletion. Case studies on elite marathoners found optimal race-day fueling to be 60 grams of carbohydrate per hour, delivered as 15 grams every 15 minutes in a 10% carbohydrate solution. I need to practice this in training, not just on race day. Every long run over 90 minutes becomes a fueling rehearsal at goal pace.

This morning’s half marathon rewrites the race plan. I ran 13.1 miles at low heart rate (137-148 bpm), averaging 8:10 for the first 11 miles before closing the last two at 7:42 and 7:49 pace. My note afterward: “very easy, could have done a lot more.” That performance suggests my aerobic base is significantly better than my historical 8:32 average for 20+ miles indicates. That average likely came from poor pacing on long runs, not lack of fitness.

The conservative approach mirrors this morning’s pattern: controlled start, gradual build, strong finish. Miles 1-10 at 7:30-7:35 pace—just like this morning’s easy start—predicted split of 1:15:00 to 1:16:40. It will feel easy, almost too easy. The temptation will be to push harder. Don’t. Miles 11-20 settle into goal pace range at 7:28-7:32, predicted split of 1:15:00 to 1:16:00. This is half-marathon distance, and my data proves I can hold 7:48 pace for 13-14 miles. Running 7:30-7:35 is only 13-23 seconds per mile faster—controlled and sustainable. Miles 21-26.2 finish strong at 7:25-7:30 pace, just like this morning’s close. Predicted split: 38:45 to 39:00. Total projected finish: 3:15:00 to 3:16:45.

The aggressive approach—even 7:30 splits from mile one for a 3:16:45 finish—only makes sense if week three’s 22-mile peak run at 7:40 average feels as easy as this morning’s half marathon felt, if weather is perfect (50-55°F, low humidity, no wind), if taper goes flawlessly, and if I wake up on race day feeling exceptional. Otherwise, the conservative negative split strategy is smarter.

Three scenarios, all representing massive improvement over historical data. Best case: 3:15:00-3:16:00, requiring perfect execution of the conservative strategy and this morning’s negative split pattern. Probability: 45 percent, up significantly after this morning’s performance. Realistic: 3:16:00-3:17:00, solid execution with maybe imperfect conditions. Probability: 40 percent. Solid: 3:17:00-3:18:30, good race with slight fade or challenging conditions. Probability: 15 percent.

All three outcomes crush the historical 8:32 pace for 20+ miles. All three are victories. The goal isn’t to cling to 7:30 pace at all costs—it’s to run the smartest race possible given training data and this morning’s proof that the aerobic base is there.

I thought I was running long runs at 7:30 pace. The data showed 7:48 for 13-14 miles and 8:32 for 20+ miles. Memory is selective. Data isn’t.

But this morning’s half marathon revealed something the historical data missed: I ran 13.1 miles at a heart rate of 137-148 bpm—easy aerobic effort—and finished strong at 7:42 and 7:49 pace. Afterward, I noted “very easy, could have done a lot more.” That 8:32 average for 20+ miles wasn’t about fitness—it was about pacing. I’d been going out too hard and fading. The aerobic base is better than the numbers suggested.

The limitation isn’t speed—my best 3-4 mile pace is 6:26. It’s not aerobic fitness—the cycling adds significant volume and this morning proved the engine is strong. The gap is specificity: I haven’t trained to hold quality pace beyond 14 miles. At 48, I need more recovery than I did at 35. Research shows masters athletes need 10-20% more recovery time. The alternating run-bike schedule isn’t a compromise—it’s smart training that keeps me healthy enough to execute the progressive long runs that will close the gap.

Seven weeks to race day. Progressive long runs build from 18 to 22 miles at progressively faster paces. Tuesday speedwork at 6:30 pace creates a one-minute-per-mile reserve over goal pace. Complete rest every five to six days. Race strategy mirrors this morning’s pattern: controlled start, build into goal pace, finish strong.

Will I hit 3:16:45? Good chance—this morning proved the base is there. Will I run 3:16:00-3:17:00? More likely. Either way, it’s significantly faster than 8:32. The data showed the problem. This morning showed the solution. Now execute.

RESEARCH REFERENCES

  1. Stellingwerff, T. (2012). “Case Study: Nutrition and Training Periodization in
    Three Elite Marathon Runners.” International Journal of Sport Nutrition and
    Exercise Metabolism, 22(5), 392-400.
  2. Sports Medicine – Open (2024). “Quantitative Analysis of 92 12-Week Sub-elite
    Marathon Training Plans.”
  3. Tanaka, H. (1994). “Effects of cross-training. Transfer of training effects on
    VO2max between cycling, running and swimming.” Sports Medicine, 18(5), 330-339.
  4. Runner’s World / Marathon Training Academy (2024-2025). “Marathon Training After 50”
  • Research on masters athlete adaptations and recovery needs.
  1. Haugen, T., et al. (2019). “The Training and Development of Elite Sprint Performance: an Integration of Scientific and Best Practice Literature.” Sports Medicine – Open.
By 0 Comments

Quick(ish) Price Check on a Car

So, is it a good price?

With my oldest daughter heading off to college soon, we’ve realized that our family car doesn’t need to be as large as it used to be. We’ve had a great relationship with our local CarMax over the years, and we appreciate their no-haggle pricing model. My wife had her eyes set on a particular model: a 2019 Volvo XC90 T6 Momentum. The specific car she found was listed at $35,998, with 47,000 miles on the odometer.

But is the price good or bad? As a hacker/data scientist, I knew could get the data to make an informed decision but doing analysis at home is a great way to learn and use new technologies. The bottom line is that the predicted price would be $40,636 or 11.4% higher than the CarMax asking price. If I compare to the specific trim, the price should be $38,666. So the price is probably fair. Now how did I come up with that number?

Calculations

Armed with Python and an array of web scraping tools, I embarked on a mission to collect data that would help me determine a fair value for our new car. I wrote a series of scripts to extract relevant information, such as price, age, and cost from various websites. This required a significant amount of Python work to convert the HTML data into a format that could be analyzed effectively.

Once I had amassed a good enough dataset (close to 200 cars), I began comparing different statistical techniques to find the most accurate pricing model. In this blog post, I’ll detail my journey through the world of logistic regression and compare it to more modern data science methods, revealing which technique ultimately led us to the fairest car price.

First, I did some basic web searching. According to Edmunds, the average price for a 2019 Volvo XC90 T6 Momentum with similar mileage is between $33,995 and $43,998 and my $35,998 falls within this range.

As for how the Momentum compares to other Volvo options and similar cars, there are a few things to consider. The Momentum is one of four trim levels available for the 2019 XC902. It comes with a number of standard features, including leather upholstery, a panoramic sunroof, and a 9-inch touchscreen infotainment system. Other trim levels offer additional features and options.

The 2019 Volvo XC90 comes in four trim levels: Momentum, R-Design, Inscription, and Excellence. The R-Design offers a sportier look and feel, while the Inscription adds more luxury features. The Excellence is the most luxurious and expensive option, with seating for four instead of seven. The Momentum is the most basic.

In terms of similar cars, some options to consider might include the Audi Q7 or the BMW X5. Both of these SUVs are similarly sized and priced to the XC90.

To get there, I had to do some web scraping, data cleaning, and built a basic logistic regression model, as well as other modern data science methods. To begin my data collection journey, I decided (in 2 seconds) to focus on three primary sources: Google’s search summary, Carvana, and Edmunds.

My first step was to search for Volvo XC90 on each of these websites. I then used the Google Chrome toolbar to inspect the webpage’s HTML structure and identify the <div> element containing the desired data. By clicking through the pages, I was able to copy the relevant HTML and put this in a text file, enclosed within <html> and <body> tags. This format made it easier for me to work with the BeautifulSoup Python library, which allowed me to extract the data I needed and convert it into CSV files.

Since the data from each source varied, I had to run several regular expressions on many fields to further refine the information I collected. This process ensured that the data was clean and consistent, making it suitable for my upcoming analysis.

Finally, I combined all the data from the three sources into a single CSV file. This master dataset provided a solid foundation for my pricing analysis and allowed me to compare various data science techniques in order to determine the most accurate and fair price for the 2019 Volvo XC90 T6 Momentum.

In the following sections, I’ll delve deeper into the data analysis process and discuss the different statistical methods I employed to make our car-buying decision.

First, data from carvana looked like this:

<div class="tk-pane full-width">
    <div class="inventory-type carvana-certified" data-qa="inventory-type">Carvana Certified
    </div>
    <div class="make-model" data-qa="make-model">
        <div class="year-make">2020 Volvo XC90</div>
    </div>
    <div class="trim-mileage" data-qa="trim-mileage"><span>T6 Momentum</span> • <span>36,614
            miles</span></div>
</div>
<div class="tk-pane middle-frame-pane">
    <div class="flex flex-col h-full justify-end" data-qa="pricing">
        <div data-qa="price" class="flex items-end font-bold mb-4 text-2xl">$44,990</div>
    </div>
</div>

In this code snippet, I used the BeautifulSoup library to extract relevant data from the saved HTML file, which contained information on Volvo XC90 listings. The script below searches for specific <div> elements containing the year, make, trim, mileage, and price details. It then cleans up the data by removing unnecessary whitespace and commas before storing it in a dictionary. Finally, the script compiles all the dictionaries into a list and exports the data to a CSV file for further analysis.

I could then repeat this process with Google to get a variety of local sources.

One challenge from the Google results, was that I had a lot of data in the images (they were base64 encoded) so wrote a bash script to clean up the tags using sed (pro tip: learn awk and sed)

When working with the Google search results, I had to take a slightly different approach compared to the strategies used for Carvana and Edmunds. Google results did not have a consistent HTML structure that could be easily parsed to extract the desired information. Instead, I focused on identifying patterns within the text format itself to retrieve the necessary details. By leveraging regular expressions, I was able to pinpoint and extract the specific pieces of information, such as the year, make, trim, mileage, and price, directly from the text. My scrape code is below.

Scraping Edmunds required both approaches of using formatting and structure.

All together, I got 174 records of used Volvo XC90s, I could easily get 10x this since the scripts exist and I could mine craigslist and other places. With the data I have, I can use R to explore the data:

# Load the readxl package
library(readxl)
library(scales)
library(scatterplot3d)

# Read the data from data.xlsx into a data frame
df <- read_excel("data.xlsx")

df$Price<-as.numeric(df$Price)/1000

# Select the columns you want to use
df <- df[, c("Title", "Desc", "Mileage", "Price", "Year", "Source")]

# Plot Year vs. Price with labeled axes and formatted y-axis
plot(df$Year, df$Price, xlab = "Year", ylab = "Price ($ '000)",
     yaxt = "n")  # Don't plot y-axis yet

# Add horizontal grid lines
grid()

# Format y-axis as currency
axis(side = 2, at = pretty(df$Price), labels = dollar(pretty(df$Price)))

abline(lm(Price ~ Year, data = df), col = "red")
Armed with this data, we can assign a logistic regression model.

This code snippet employs the scatterplot3d() function to show a 3D scatter plot that displays the relationship between three variables in the dataset. Additionally, the lm() function is utilized to fit a linear regression model, which helps to identify trends and patterns within the data. To enhance the plot and provide a clearer representation of the fitted model, the plane3d() function is used to add a plane that represents the linear regression model within the 3D scatter plot.

model <- lm(Price ~ Year + Mileage, data = df)

# Plot the data and model
s3d <- scatterplot3d(df$Year, df$Mileage, df$Price,
                     xlab = "Year", ylab = "Mileage", zlab = "Price",
                     color = "blue")
s3d$plane3d(model, draw_polygon = TRUE)

So, we can now predict the price of 2019 Volvo XC90 T6 Momentum with 47K miles, which is $40,636 or 11.4% higher than the CarMax asking price of $35,998.

# Create a new data frame with the values for the independent variables
new_data <- data.frame(Year = 2019, Mileage = 45000)

# Use the model to predict the price of a 2019 car with 45000 miles
predicted_price <- predict(model, new_data)

# Print the predicted price
print(predicted_price)

Other Methods

Ok, so now let’s use “data science”. Besides linear regression, there are several other techniques that I can use to take into account the multiple variables (year, mileage, price) in your dataset. Here are some popular techniques:

Decision Trees: A decision tree is a tree-like model that uses a flowchart-like structure to make decisions based on the input features. It is a popular method for both classification and regression problems, and it can handle both categorical and numerical data.

Random Forest: Random forest is an ensemble learning technique that combines multiple decision trees to make predictions. It can handle both regression and classification problems and can handle missing data and noisy data.

Support Vector Machines (SVM): SVM is a powerful machine learning algorithm that can be used for both classification and regression problems. It works by finding the best hyperplane that separates the data into different classes or groups based on the input features.

Neural Networks: Neural networks are a class of machine learning algorithms that are inspired by the structure and function of the human brain. They are powerful models that can handle both numerical and categorical data and can be used for both regression and classification problems.

Gradient Boosting: Gradient boosting is a technique that combines multiple weak models to create a stronger one. It works by iteratively adding weak models to a strong model, with each model focusing on the errors made by the previous model.

All of these techniques can take multiple variables into account, and each has its strengths and weaknesses. The choice of which technique to use will depend on the specific nature of your problem and your data. It is often a good idea to try several techniques and compare their performance to see which one works best for your data.

I’m going to use random forest and a decision tree model.

Random Forest

# Load the randomForest package
library(randomForest)

# "Title", "Desc", "Mileage", "Price", "Year", "Source"

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Fit a random forest model
model <- randomForest(Price ~ Year + Mileage, data = train_data, ntree = 500)

# Predict the prices for the test data
predictions <- predict(model, test_data)

# Calculate the mean squared error of the predictions
mse <- mean((test_data$Price - predictions)^2)

# Print the mean squared error
cat("Mean Squared Error:", mse)

The output from the random forest model you provided indicates that the model has a mean squared error (MSE) of 17.14768 and a variance explained of 88.61%. A lower MSE value indicates a better fit of the model to the data, while a higher variance explained value indicates that the model can explain a larger portion of the variation in the target variable.

Overall, an MSE of 17.14768 is reasonably low and suggests that the model has a good fit to the training data. A variance explained of 88.61% suggests that the model is able to explain a large portion of the variation in the target variable, which is also a good sign.

However, the random forest method shows a predicted cost of $37,276.54.

I also tried cross-validation techniques to get a better understanding of the model’s overall performance (MSE 33.890). Changing to a new technique such as a decision tree model, turned MSE into 50.91. Logistic regression works just fine.

Adding the Trim

However, I was worried that I was comparing the Momentum to the higher trim options. So to get the trim, I tried the following prompt in Gpt4 to translate the text to one of the four trims.

don't tell me the steps, just do it and show me the results.
given this list add, a column (via csv) that categorizes each one into only five categories Momentum, R-Design, Inscription, Excellence, or Unknown

That worked perfectly and we can see that we have mostly Momentums.

ExcellenceInscriptionMomentumR-DesignUnknown
Count0688789
Percent0.00%39.53%50.58%4.65%5.23%
Frequency and Count of Cars

And this probably invalidates my analysis as Inscriptions (in blue) do have clearly higher prices:

Plot of Price By Year

We can see the average prices (in thousands). In 2019 Inscriptions cost less than Momentums? That is probably a small \(n\) problem since we only have 7 Inscriptions and 16 Momentum’s in our data set for 2019.

YearR-DesignInscriptionMomentum
2014$19.99NANA
2016$30.59$32.59$28.60
2017$32.79$32.97$31.22
2018$37.99$40.69$33.23
2019NA$36.79$39.09
2020NA$47.94$43.16
Average Prices by Trim (in thousand dollars)

So, if we restrict our data set smaller, what would the predicted price of the 2019 Momentum be? Just adding a filter and running our regression code above we have $38,666 which means we still have a good/reasonable price.

Quick Excursion

One last thing I’m interested in: does mileage or age matter more. Let’s build a new model.

# Create Age variable
df$Age <- 2023 - df$Year

# Fit a linear regression model
model <- lm(Price ~ Mileage + Age, data = df)

# Print the coefficients
summary(model)$coef
EstimateStd. Errort valuePr(>|t|)
(Intercept)61.349130.6908488.803722.28E-144
Mileage-0.000222.44E-05-8.838691.18E-15
Age-2.754590.27132-10.15253.15E-19
Impact of Different Variables

Based on the regression results, we can see that both Age and Mileage have a significant effect on Price, as their p-values are very small (<0.05). However, we can also see that Age has a larger absolute t-score (-10.15) than Mileage (-8.84), indicating that Age may have a slightly greater effect on Price than Mileage. Additionally, the estimates show that for every one-year increase in Age, the Price decreases by approximately 2.75 thousand dollars, while for every one-mile increase in Mileage, the Price decreases by approximately 0.0002 thousand dollars (or 20 cents). That is actually pretty interesting.

This isn’t that far off. According to the US government, a car depreciates by an average of $0.17 per mile driven. This is based on a five-year ownership period, during which time a car is expected to be driven approximately 12,000 miles per year, for a total of 60,000 miles.

In terms of depreciation per year, it can vary depending on factors such as make and model of the car, age, and condition. However, a general rule of thumb is that a car can lose anywhere from 15% to 25% of its value in the first year, and then between 5% and 15% per year after that. So on average, a car might depreciate by about 10% per year.

Code

While initially in the original blog post, I moved all the code to the end.

Carvana Scrape Code

Cleaner Code

Google Scrape Code

Edumund’s Scrape Code

By 6 Comments