Financial Data Project #1: Yield Curves Modeling and Correlation Structures in Financial Markets¶
In fixed income and equity markets, understanding how interest rates evolve across maturities and how assest returns move together is essential for tisk management, pricing, and portfolio diversification. This project is part of the MSc. in Financial Engineering program at WorldQuant University. Aimed to explores three fundamental quantitative finance problems:
- How can we model the term structure of interest rates?
- How are movements across maturities correlated?
- Can similar factor structures explain stock market co-movements?
Together, these form an end-to-end workflow connecting fixed-income theory with empirical equity analysis.
Methodology¶
Data sources:
- Vietnam HNX Treasury Bonds - Spot rate data (11 maturities, 3M-20Y, Oct 2025)
- Reserve Bank of Australia (RBA) - Daily yields (1Y-10Y) over six months
- State Street XLK ETF - Daily prices of 30 largest technology stocks (Apr - Oct 2025)
Techniques:
- Yield Curve Modelling: Nelson–Siegel (parametric) vs Cubic Spline (non-parametric)
- Factor Analysis: PCA and Scree plots on yield changes and stock returns
- Decomposition: Singular Value Decomposition (SVD) to verify PCA results
#Setup environment
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
!pip install -q ipykernel nbconvert scikit-learn openpyxl nelson_siegel_svensson
from nelson_siegel_svensson.calibrate import calibrate_ns_ols
from scipy.interpolate import CubicSpline
from sklearn.metrics import mean_squared_error
1. Yield Curve Modeling¶
We select the HNX government securities from our country, Vietnam, and use the official HNX data export (HNX, “Yield Curve”) in Excel format for a specific settlement date. The file contains maturities (in Vietnamese), spot rates (continuous), par yields, and spot rates (annual).
Then, we load the Excel file, extract the table, and create both:
- a string tenor column for plotting (e.g., “3M”, “1Y”) and
- a numeric maturity in years for modeling (e.g., 0.25, 1, 2, …).
We use Spot Rate Annual for modeling.
notebook_dir = os.getcwd()
file_path = os.path.join(notebook_dir, "Bond_market_data_01_10_2025.xlsx")
df = pd.read_excel(file_path, header=None, skiprows = 4, usecols = [0,2,4,6])
df.head(10)
| 0 | 2 | 4 | 6 | |
|---|---|---|---|---|
| 0 | Kỳ hạn còn lại | Spot rate liên tục (%) | Par Yield (%) | Spot rate theo năm (%) |
| 1 | 3 tháng | 2.737066 | NaN | 2.774868 |
| 2 | 6 tháng | 2.759724 | NaN | 2.798157 |
| 3 | 9 tháng | 2.782112 | NaN | 2.821175 |
| 4 | 1 năm | 2.804233 | 2.843921 | 2.843921 |
| 5 | 2 năm | 2.890047 | 2.930938 | 2.932214 |
| 6 | 3 năm | 2.971636 | 3.012858 | 3.01623 |
| 7 | 5 năm | 3.122353 | 3.161771 | 3.17161 |
| 8 | 7 năm | 3.256809 | 3.291521 | 3.310423 |
| 9 | 10 năm | 3.428918 | 3.452486 | 3.488383 |
This data is in Vietnamese and has different formatting, so I need to clean and transform it before modelling:
#Convert tenor labels to English
def convert_vn_tenor_to_str(text):
"""
Convert Vietnamese tenor labels (e.g. '3 tháng', '1 năm')
into English-style labels ('3M', '1Y', etc.).
"""
text = str(text).strip().lower()
if "tháng" in text:
number = int("".join(filter(str.isdigit, text)))
return f"{number}M"
elif "năm" in text:
number = int("".join(filter(str.isdigit, text)))
return f"{number}Y"
else:
return text
#Convert tenor to years for NS modelling
def tenor_to_years(s):
s = str(s).upper().strip()
if s.endswith("M"):
return float(s[:-1]) / 12.0
if s.endswith("Y"):
return float(s[:-1])
return np.nan
#Convert Numeric text to float
def to_float(x):
try:
return float(str(x).replace(",", "."))
except:
return None
df.columns = ['Maturity', 'Spotrate_Continuous', 'ParYield', 'Spotrate_Annual']
df["Maturity"] = df["Maturity"].apply(convert_vn_tenor_to_str) #Convert tenor names to English tenors
for col in ["ParYield"]:
df[col] = df[col].map(to_float)
#Drop unecessary columns
df.drop(columns = ['ParYield', 'Spotrate_Continuous'],inplace = True)
df = df.rename(columns={"Years": "Maturity"})
df = df.drop(index=0).reset_index(drop=True)
df
| Maturity | Spotrate_Annual | |
|---|---|---|
| 0 | 3M | 2.774868 |
| 1 | 6M | 2.798157 |
| 2 | 9M | 2.821175 |
| 3 | 1Y | 2.843921 |
| 4 | 2Y | 2.932214 |
| 5 | 3Y | 3.01623 |
| 6 | 5Y | 3.17161 |
| 7 | 7Y | 3.310423 |
| 8 | 10Y | 3.488383 |
| 9 | 15Y | 3.707359 |
| 10 | 20Y | 3.834327 |
df.describe() #Exploratory Data Analysis
Exploratory Data Analysis:
The dataset contains 11 observations of annualised spot rates for Vietnam Treasury bonds across various maturities. The mean annual spot rate is approximately 3.15%, with a standard deviation of 0.38%, indicating moderate variation in yields across maturities.
The minimum rate is 2.77%, while the maximum reaches 3.83%, showing an upward-sloping yield curve pattern where longer maturities tend to have higher yields. The median rate (3.02%) aligns closely with the mean, suggesting a fairly symmetric distribution without extreme outliers.
Overall, the data reflects a typical, gradually increasing yield structure, consistent with a normal yield curve observed in stable economic conditions.
Now, let's plot the actual yield curve:
plt.figure(figsize = (9,5))
plt.plot(df['Maturity'], df['Spotrate_Annual'], marker = "o", linewidth = 1.8)
for i, (x, y) in enumerate(zip(df["Maturity"], df["Spotrate_Annual"])):
offset = 0.05 #Format labels
if i == len(df) - 1: # last label
plt.text(x, y -0.1, f"{y:.2f}", ha="left", fontsize=11, color="black")
else:
plt.text(x, y + 0.06, f"{y:.2f}", ha="center", fontsize=11, color="black")
#Naming title and axes
plt.xlabel("Maturity (Years)")
plt.ylabel("Yield [%]")
plt.title("Figure 1. Vietnam Treasury Bond (HNX) Yield Curve at 01-10-2025")
plt.tight_layout(pad=2)
plt.grid(True, alpha = 0.3)
plt.show()
After visualisation (Figure 1), the yield curve exhibits a normal upward-sloping shape, common for yield curves:
- Short-term yields (3M-1Y) are relatively low, around 2.8%, reflecting lower compensation for short-term lending.
- Medium-term yields (2Y-7Y) gradually increase toward 3.3%, suggesting moderate expectations of future economic growth or inflation.
- Long-term yields (10Y-20Y) reach around 3.8%, indicating that investors demand a higher return for locking funds over extended periods.
Yield Curve Modeling using the Nelson-Siegel model¶
The Nelson-Siegel model is a popular method used to show how yields change with bond maturities:
$$y(t)=\beta_{0}+\beta_{1}\left( \frac{1-e^{-\lambda t}}{\lambda t} \right)+\beta_{2}\left( \frac{1-e^{^{-\lambda t}}}{\lambda t}-e^{-\lambda t} \right)+\epsilon$$With:
- $y(t)$: yield at maturity $t$
- $\beta_{0}, \beta_{1}, \beta_{2}$, and $\tau$: Model parameters, they represent the level (long-term yield component), the slope (short-term yield component), the curvature (medium-term hump), and the decay rate (scale parameter) of the model, respectively
- $\epsilon$: residual/error term
In this project, the Nelson–Siegel model was chosen because its parameters give us information about economic insights, meaning long-run rate expectations, short-term policy stance, and medium-term market sentiment. While splines or high-order regressions could perfectly interpolate observed points, they lack the economic insight and extrapolation reliability that Nelson–Siegel provides.
As a result, this model not only reproduces the observed upward-sloping curve with high accuracy but also enables a deeper understanding of how interest rate factors evolve across maturities.
from scipy.optimize import least_squares
df = df.dropna(subset=["Maturity", "Spotrate_Annual"])
df["Maturity_Years"] = df["Maturity"].map(tenor_to_years)
df = df.dropna(subset=["Maturity_Years"])
x = df["Maturity_Years"].astype(float).to_numpy()
y = df["Spotrate_Annual"].astype(float).to_numpy()
# Nelson-Siegel function
def nelson_siegel(maturity, beta0, beta1, beta2, tau):
t = maturity / tau
e = np.exp(-t)
term1 = (1 - e) / t
return beta0 + beta1 * term1 + beta2 * (term1 - e)
# Residuals
def residuals(params):
return nelson_siegel(x, *params) - y
initial_guess = [3.0, -1.0, 0.5, 1.0]
bounds = ([0, -10, -10, 0.1], [10, 10, 10, 10])
# Fit
result = least_squares(residuals, x0=initial_guess, bounds=bounds)
beta0, beta1, beta2, tau = result.x
print(f"β0={beta0:.4f}, β1={beta1:.4f}, β2={beta2:.4f}, τ={tau:.4f}")
β0=4.3350, β1=-1.5796, β2=-0.8619, τ=4.1588
Using this model, we estimate the following parameters: $\beta_{0} = 4.3350$, $\beta_{1} = -1.5796$, $\beta_{2} = -0.8619$ and $\tau = 4.1588$.
Now we can fit the model and plot the fitting curve:
x_fit = np.linspace(min(x), max(x), 200)
y_fit = nelson_siegel(x_fit, beta0, beta1, beta2, tau)
plt.figure(figsize=(9,5))
plt.scatter(x, y*100, color="red", label="Actual Yields")
plt.plot(x_fit, y_fit*100, color="blue", label="Nelson-Siegel Fit")
plt.xlabel("Maturity (Years)")
plt.ylabel("Yield [%]")
plt.title("Figure 2. Vietnam Treasury Bond Yield Curve at 01-10-2025 - Nelson-Siegel Fit")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
The Nelson-Siegel model successfully estimated the parameters governing the shape of the Vietnam Treasury bond yield curve as of 01-10-2025 with:
$\beta_{0} = 4.3350$, $\beta_{1} = -1.5796$, $\beta_{2} = -0.8619$ and $\tau = 4.1588$.
- $\beta_{0}$ (Level Factor) represents the long-term yield level toward which the curve converges as maturity increases. Here, the long-run interest rate stabilises around 4.33%, indicating a moderate expected rate environment.
- $\beta_{1}$ (Slope Factor) determines the short-term steepness of the yield curve. A negative slope implies that short-term yields are lower than long-term yields, forming an upward-sloping curve; this is consistent with normal market conditions where investors demand higher yields for longer maturities.
- $\beta_{2}$ (Curvature Factor) controls the hump shape or curvature of the yield curve at medium-term maturities. The negative value suggests that mid-term yields are slightly flatter, meaning the curve rises smoothly without a strong hump.
- $\tau$ (Decay Rate) determines the maturity point where the curve transitions from steep to flat. A $\tau$ value ~4 implies that the curve flattens around four years, after which long-term yields stabilise.
Cubic Spline Fitting of Yield Curve¶
For comparison, we fit a Cubic Spline model (k = 3) to interpolate the yield curve across maturities. The maturities (0.25-20 years) are treated as the knot points, and the spline estimates a smooth function y(t)=f(t) that passes through all observed yields.
The fitted spline (red line) produces a smooth and continuous curve connecting the discrete yield observations. This model captures the overall upward-sloping pattern of the Vietnamese yield curve at 01-10-2025, while allowing for minor local curvature between maturities.
from scipy.interpolate import make_interp_spline
x = np.array([0.25, 0.5, 0.75, 1, 2, 3, 5, 7, 10, 15, 20]) # Maturities
y = df['Spotrate_Annual'].values # Yields
# Fit a cubic spline
spline = make_interp_spline(x, y, k=3)
x_smooth = np.linspace(x.min(), x.max(), 200)
y_smooth = spline(x_smooth)
#Plot the fitted curve
plt.figure(figsize=(8,5))
plt.plot(x, y, 'o', label="Observed Yields")
plt.plot(x_smooth, y_smooth, '-', label="Cubic Spline Fit", color='red')
plt.xlabel("Maturity (Years)")
plt.ylabel("Yield [%]")
plt.title("Figure 3. Vietnam Treasury Yield Curve at 01-10-2025 - Cubic Spline Fit")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
#Extract coefficients
coefficients = spline.c
knots = spline.t
print(knots)
print(coefficients)
[ 0.25 0.25 0.25 0.25 0.75 1. 2. 3. 5. 7. 10. 20. 20. 20. 20. ] [2.77486769 2.79048449 2.81363746 2.86671314 2.93292727 3.04493812 3.1743708 3.33624889 3.63250271 3.7790624 3.83432704]
For the Cubic Spline model, the curve is defined by a series of local polynomial segments between maturity knots. The fitted spline includes 15 knot points (from 0.25 to 20 years) and the corresponding spline coefficients ($\alpha_{1}-\alpha_{11}$) representing the fitted yield levels at each maturity. For example: $\alpha_{1}$ = 2.7749 (0.25Y), $\alpha_{5}$ = 2.9329 (0.75Y), $\alpha_{9}$ = 3.6325 (5Y), $\alpha_{11}$ = 3.8343 (20Y).
Comparison between Cubic Spline and NS¶
from sklearn.metrics import mean_squared_error
rmse_ns = np.sqrt(mean_squared_error(y, nelson_siegel(x, beta0, beta1, beta2, tau)))
rmse_spline = np.sqrt(mean_squared_error(y, spline(x)))
print(f"NS RMSE: {rmse_ns:.6f}, Spline RMSE: {rmse_spline:.6f}")
NS RMSE: 0.003136, Spline RMSE: 0.000000
- Fit comparison:
The Cubic Spline model achieves an RMSE of nearly zero, meaning it perfectly fits all observed yield points. This happens because the spline is an interpolating function, it passes exactly through every data point, minimising residual errors to almost nothing.
In contrast, the Nelson-Siegel model has a slightly higher RMSE (0.0031), which is expected. Since it is a parametric model, it smooths the data based on its functional form rather than perfectly interpolating each observation. This allows it to capture the general trend and curvature of the yield curve without overfitting small fluctuations.
- Interpretation:
The Nelson-Siegel model provides a smooth and interpretable fit to the yield curve, where each parameter $\beta_{0}, \beta_{1}, \beta_{2}$ and $\tau$ represents an economic aspect such as long-term level, short-term slope, and medium-term curvature. In contrast, the Cubic Spline model offers a near-perfect mathematical fit by interpolating through all data points, resulting in a lower RMSE but limited economic interpretability. Thus, while the spline excels in precision, the Nelson-Siegel model is preferred for understanding and analysing the underlying term structure of interest rates.
2. Exploiting Correlation¶
In this section, we aim to understand how yields move together across different maturities, which is crucial for both fixed-income investors and policymakers. A change in monetary policy or macroeconomic outlook often shifts the yield curve, as bond yields are not indepedent.
By studying the correlation structure of these movements, we can uncover the latent factors that drive most of the variation in interest rates.
This analysis focuses on daily yield changes for Australian government bonds with maturities from 1 to 10 years.
Instead of treating each maturity as a separate variable, we apply Principal Component Analysis (PCA) to identify the dominant underlying forces influencing yield dynamics.
PCA helps us reduce a complex, multi-dimensional dataset into a few orthogonal factors, typically interpreted as:
- Level: parallel shifts of the whole curve,
- Slope: steepening or flattening of short vs. long maturities,
- Curvature: twists around the middle segment of the curve.
To assess how many of these factors are meaningful, we use a Scree Plot, which displays the proportion of variance explained by each principal component.
We used data from the Reserve Bank of Australia (RBA, F2: Australian Government Securities Yields). Five maturities were selected to represent the Australian government securities yield curve - 1-year, 2-year, 3-year, 5-year, and 10-year bonds, covering a six-month period.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
#online link
#rba = pd.read_csv("https://www.rba.gov.au/statistics/tables/csv/f2-data.csv", skiprows=10, encoding='ISO-8859-1')
notebook_dir = os.getcwd()
file_path = os.path.join(notebook_dir, "f2-data.csv")
rba = pd.read_csv(file_path, skiprows=10, encoding='ISO-8859-1', low_memory=False)
cols = ["Series ID", "FCMYGBAGID", "FCMYGBAG2D", "FCMYGBAG3D", "FCMYGBAG5D", "FCMYGBAG10D"]
rba = rba[cols].dropna()
rba.columns = ["Date", "1Y", "2Y", "3Y", "5Y", "10Y"]
rba["Date"] = pd.to_datetime(rba["Date"])
rba = rba.set_index("Date")
yields = rba[rba.index >= "2025-04-01"]
head_rows = yields.head(5)
tail_rows = yields.tail(5)
display(pd.concat([head_rows, pd.DataFrame([["..."] * yields.shape[1]], columns=yields.columns), tail_rows]))
| 1Y | 2Y | 3Y | 5Y | 10Y | |
|---|---|---|---|---|---|
| 2025-04-01 00:00:00 | 2.22 | 3.665 | 3.69 | 3.842 | 4.391 |
| 2025-04-02 00:00:00 | 2.21 | 3.689 | 3.713 | 3.861 | 4.397 |
| 2025-04-03 00:00:00 | 2.061 | 3.527 | 3.549 | 3.697 | 4.242 |
| 2025-04-04 00:00:00 | 2.078 | 3.394 | 3.417 | 3.593 | 4.199 |
| 2025-04-07 00:00:00 | 2.092 | 3.252 | 3.272 | 3.47 | 4.081 |
| 0 | ... | ... | ... | ... | ... |
| 2025-09-25 00:00:00 | 2.049 | 3.491 | 3.54 | 3.738 | 4.356 |
| 2025-09-26 00:00:00 | 2.049 | 3.521 | 3.577 | 3.773 | 4.395 |
| 2025-09-29 00:00:00 | 2.004 | 3.475 | 3.529 | 3.724 | 4.342 |
| 2025-09-30 00:00:00 | 1.97 | 3.487 | 3.538 | 3.722 | 4.307 |
| 2025-10-01 00:00:00 | 1.996 | 3.519 | 3.569 | 3.756 | 4.357 |
To analyse movements in the yield curve, we compute the daily yield changes instead of using raw yield levels.
This is done by taking the first difference of yields across consecutive dates using the .diff() command.
#Compute daily yield changes
yield_changes = yields.diff(axis = 0).dropna()
#Preview
yield_changes
| 1Y | 2Y | 3Y | 5Y | 10Y | |
|---|---|---|---|---|---|
| Date | |||||
| 2025-04-02 | -0.010 | 0.024 | 0.023 | 0.019 | 0.006 |
| 2025-04-03 | -0.149 | -0.162 | -0.164 | -0.164 | -0.155 |
| 2025-04-04 | 0.017 | -0.133 | -0.132 | -0.104 | -0.043 |
| 2025-04-07 | 0.014 | -0.142 | -0.145 | -0.123 | -0.118 |
| 2025-04-08 | 0.140 | 0.060 | 0.065 | 0.085 | 0.141 |
| ... | ... | ... | ... | ... | ... |
| 2025-09-25 | 0.038 | 0.043 | 0.039 | 0.043 | 0.059 |
| 2025-09-26 | 0.000 | 0.030 | 0.037 | 0.035 | 0.039 |
| 2025-09-29 | -0.045 | -0.046 | -0.048 | -0.049 | -0.053 |
| 2025-09-30 | -0.034 | 0.012 | 0.009 | -0.002 | -0.035 |
| 2025-10-01 | 0.026 | 0.032 | 0.031 | 0.034 | 0.050 |
127 rows × 5 columns
Then, the PCA was employed using the covariance matrix, as all yield changes are measured on the same scale, making standardisation unnecessary.
#Initiate the PCA model and fit the yield_changes
pca = PCA()
pca.fit(yield_changes)
#Display the variance contribution
explained_var = pca.explained_variance_ratio_
print("Explained Variance Ratio for each component:")
for i, var in enumerate(explained_var, start=1):
print(f"Component {i}: {var:.4f} ({var*100:.2f}%)")
Explained Variance Ratio for each component: Component 1: 0.8299 (82.99%) Component 2: 0.1557 (15.57%) Component 3: 0.0130 (1.30%) Component 4: 0.0011 (0.11%) Component 5: 0.0003 (0.03%)
Based on the results in (g), the first principal component (PC1) explains approximately 83% of the total variance, representing a parallel shift (level movement) of the entire yield curve. PC2 (around 15.6%) captures slope changes, the steepening or flattening of the curve. PC3 (about 1.4%) reflects curvature or twisting effects, while the remaining components contribute minimal explanatory power.
This pattern confirms that most yield movements are driven by a single dominant factor, consistent with the theoretical structure of government bond markets. To better visualize this variance contribution, let's look at the scree plot:
#Plotting the Scree Plot
plt.figure(figsize=(6,4))
plt.plot(range(1, len(explained_var)+1), explained_var, 'o-', linewidth=2)
plt.title("Scree Plot - Australian Government Bonds")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
plt.grid(True)
plt.show()
The government bond data scree plot shows that the first component dominates, explaining about 83% of the total variance. This pattern confirms that most yield movements are driven primarily by a single dominant factor (the level component), consistent with the three-factor theoretical structure of government bond markets, where slope and curvature play smaller secondary roles.
3. Empirical Analysis of ETFs¶
After exploring interest rate dynamics in the bond market, we extend the analysis to the equity domain by examining the correlation structure of assets within the State Street XLK Technology ETF.
This ETF tracks the performance of major U.S. technology companies such as Apple, Microsoft, and NVIDIA, providing a representative sample of the broader tech sector. The 30 largest holdings’ tickers were collected from State Street Global Advisors as of 1 October 2025.
Equity returns, like yields, tend to move together due to systematic market forces, such as macroeconomic conditions, sector-wide shocks, or investor sentiment. However, not all stocks react equally: some are more sensitive to common risk factors, while others are driven by firm-specific characteristics.
To uncover these underlying relationships, we apply PCA to the daily return matrix of the ETF’s top constituents. To help us quantify the total variation in stock returns explained by systemic factors, identify the dominant drivers of co-movement across assets, and differentiate between broad market effects and idiosyncratic noise.
Through a Scree Plot, we can assess how many meaningful components capture most of the ETF’s return behavior. Typically, a strong first component indicates a single market-wide factor dominating the entire sector, while smaller components reflect more specific sub-sector dynamics or company-level effects. Therefore, we can reveal the core economic forces driving asset co-movement by dimensionality reduction and factor extraction, bridging fixed income and equity markets.
#Import data
notebook_dir = os.getcwd()
file_path = os.path.join(notebook_dir, "holdings-daily-us-en-xlk.xlsx")
df = pd.read_excel(file_path, skiprows=4)
print("Columns in file:", df.columns.tolist())
df_top30 = df.sort_values(by='Weight', ascending=False).head(30).reset_index(drop=True)
print("Top 30 Holdings of XLK:")
display(df_top30[['Ticker', 'Name', 'Weight']])
tickers = df_top30['Ticker'].tolist()
print(tickers)
Columns in file: ['Name', 'Ticker', 'Identifier', 'SEDOL', 'Weight', 'Sector', 'Shares Held', 'Local Currency'] Top 30 Holdings of XLK:
| Ticker | Name | Weight | |
|---|---|---|---|
| 0 | NVDA | NVIDIA CORP | 14.824341 |
| 1 | MSFT | MICROSOFT CORP | 12.330501 |
| 2 | AAPL | APPLE INC | 12.273676 |
| 3 | AVGO | BROADCOM INC | 5.116163 |
| 4 | PLTR | PALANTIR TECHNOLOGIES INC A | 3.819037 |
| 5 | ORCL | ORACLE CORP | 3.720152 |
| 6 | AMD | ADVANCED MICRO DEVICES | 2.472807 |
| 7 | CSCO | CISCO SYSTEMS INC | 2.413904 |
| 8 | IBM | INTL BUSINESS MACHINES CORP | 2.397773 |
| 9 | CRM | SALESFORCE INC | 2.050192 |
| 10 | MU | MICRON TECHNOLOGY INC | 1.844963 |
| 11 | INTU | INTUIT INC | 1.707569 |
| 12 | NOW | SERVICENOW INC | 1.699435 |
| 13 | LRCX | LAM RESEARCH CORP | 1.670124 |
| 14 | APP | APPLOVIN CORP CLASS A | 1.661486 |
| 15 | QCOM | QUALCOMM INC | 1.634886 |
| 16 | AMAT | APPLIED MATERIALS INC | 1.610849 |
| 17 | TXN | TEXAS INSTRUMENTS INC | 1.488057 |
| 18 | INTC | INTEL CORP | 1.465687 |
| 19 | ACN | ACCENTURE PLC CL A | 1.366281 |
| 20 | APH | AMPHENOL CORP CL A | 1.354537 |
| 21 | KLAC | KLA CORP | 1.349615 |
| 22 | ADBE | ADOBE INC | 1.338528 |
| 23 | ANET | ARISTA NETWORKS INC | 1.336609 |
| 24 | PANW | PALO ALTO NETWORKS INC | 1.255049 |
| 25 | CRWD | CROWDSTRIKE HOLDINGS INC A | 1.111665 |
| 26 | ADI | ANALOG DEVICES INC | 1.076651 |
| 27 | CDNS | CADENCE DESIGN SYS INC | 0.849445 |
| 28 | SNPS | SYNOPSYS INC | 0.782688 |
| 29 | MSI | MOTOROLA SOLUTIONS INC | 0.670122 |
['NVDA', 'MSFT', 'AAPL', 'AVGO', 'PLTR', 'ORCL', 'AMD', 'CSCO', 'IBM', 'CRM', 'MU', 'INTU', 'NOW', 'LRCX', 'APP', 'QCOM', 'AMAT', 'TXN', 'INTC', 'ACN', 'APH', 'KLAC', 'ADBE', 'ANET', 'PANW', 'CRWD', 'ADI', 'CDNS', 'SNPS', 'MSI']
Daily closing prices for the 30 constituent stocks of the XLK ETF were collected from Yahoo Finance using the yfinance Python library over a six-month period, based on the tickets identified in part a, covering the period from April to October 2025.
#Import relevant packages
import pandas as pd
import yfinance as yf
from datetime import datetime
#Specify the date range
end_date = "2025-10-01"
start_date = "2025-04-01"
#Download the data
data = yf.download(tickers, start=start_date, end=end_date, interval='1d', auto_adjust=True)['Close']
data = data.dropna(how='all')
print("6-month close prices (last 5 days):")
display(data.tail())
[*********************100%***********************] 30 of 30 completed
6-month close prices (last 5 days):
| Ticker | AAPL | ACN | ADBE | ADI | AMAT | AMD | ANET | APH | APP | AVGO | ... | MSI | MU | NOW | NVDA | ORCL | PANW | PLTR | QCOM | SNPS | TXN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2025-09-24 | 252.309998 | 239.080002 | 353.269989 | 248.610001 | 201.440002 | 160.880005 | 142.639999 | 123.129997 | 641.919983 | 339.309998 | ... | 455.130005 | 161.608795 | 933.369995 | 176.970001 | 307.925629 | 200.699997 | 179.559998 | 173.550003 | 468.089996 | 182.808289 |
| 2025-09-25 | 256.869995 | 232.559998 | 354.160004 | 247.529999 | 199.600006 | 161.270004 | 143.059998 | 122.330002 | 639.909973 | 336.100006 | ... | 455.730011 | 156.731857 | 918.609985 | 177.690002 | 290.825287 | 202.210007 | 179.119995 | 169.679993 | 487.200012 | 180.429520 |
| 2025-09-26 | 255.460007 | 238.970001 | 360.369995 | 247.559998 | 203.919998 | 159.460007 | 142.500000 | 122.599998 | 669.859985 | 334.529999 | ... | 456.519989 | 157.171570 | 936.000000 | 178.190002 | 282.968933 | 202.369995 | 177.570007 | 169.199997 | 487.760010 | 182.917328 |
| 2025-09-29 | 254.429993 | 247.000000 | 359.420013 | 244.789993 | 204.949997 | 161.360001 | 143.369995 | 121.010002 | 712.359985 | 327.899994 | ... | 454.179993 | 163.797424 | 940.849976 | 181.850006 | 282.270172 | 203.960007 | 178.860001 | 165.300003 | 481.609985 | 181.608994 |
| 2025-09-30 | 254.630005 | 246.600006 | 352.750000 | 245.699997 | 204.740005 | 161.789993 | 145.710007 | 123.750000 | 718.539978 | 329.910004 | ... | 457.290009 | 167.215286 | 920.280029 | 186.580002 | 280.752777 | 203.619995 | 182.419998 | 166.360001 | 493.390015 | 182.104568 |
5 rows × 30 columns
#EDA on the raw data
summary_stats = data.describe().T.round(2)
print("Summary Statistics of 30 XLK Holdings (6-Month Close Prices):")
display(summary_stats)
Summary Statistics of 30 XLK Holdings (6-Month Close Prices):
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Ticker | ||||||||
| AAPL | 126.0 | 213.89 | 17.29 | 172.00 | 201.15 | 209.85 | 227.61 | 256.87 |
| ACN | 126.0 | 281.41 | 27.31 | 232.56 | 255.38 | 283.81 | 304.78 | 321.60 |
| ADBE | 126.0 | 371.83 | 23.33 | 333.65 | 352.88 | 367.12 | 384.71 | 420.68 |
| ADI | 126.0 | 223.54 | 23.42 | 163.21 | 213.67 | 229.14 | 244.05 | 254.62 |
| AMAT | 126.0 | 169.63 | 18.70 | 126.23 | 157.19 | 168.39 | 184.95 | 204.95 |
| AMD | 126.0 | 135.54 | 29.81 | 78.21 | 111.06 | 138.47 | 161.34 | 184.42 |
| ANET | 126.0 | 107.98 | 24.93 | 64.37 | 90.48 | 101.53 | 135.47 | 153.04 |
| APH | 126.0 | 95.80 | 17.26 | 58.90 | 85.38 | 97.80 | 109.31 | 125.40 |
| APP | 126.0 | 393.76 | 111.76 | 219.37 | 336.17 | 365.33 | 438.64 | 718.54 |
| AVGO | 126.0 | 262.04 | 55.16 | 145.70 | 229.00 | 270.52 | 299.23 | 368.94 |
| CDNS | 126.0 | 317.82 | 33.91 | 231.64 | 297.58 | 318.71 | 348.99 | 373.37 |
| CRM | 126.0 | 259.04 | 14.05 | 231.26 | 246.59 | 259.80 | 268.85 | 290.18 |
| CRWD | 126.0 | 444.34 | 41.59 | 321.63 | 423.56 | 444.21 | 476.27 | 514.10 |
| CSCO | 126.0 | 64.28 | 4.67 | 52.55 | 62.29 | 66.56 | 67.74 | 71.36 |
| IBM | 126.0 | 258.15 | 18.99 | 218.09 | 242.27 | 256.92 | 278.15 | 292.80 |
| INTC | 126.0 | 22.50 | 3.20 | 18.13 | 20.32 | 21.73 | 23.64 | 35.50 |
| INTU | 126.0 | 697.32 | 67.47 | 541.40 | 654.18 | 697.43 | 761.52 | 805.92 |
| KLAC | 126.0 | 842.31 | 119.53 | 573.90 | 762.05 | 876.60 | 917.60 | 1078.60 |
| LRCX | 126.0 | 92.55 | 17.31 | 58.83 | 82.13 | 96.55 | 100.56 | 133.90 |
| MSFT | 126.0 | 472.03 | 49.66 | 353.33 | 451.48 | 495.47 | 509.17 | 534.76 |
| MSI | 126.0 | 433.41 | 24.52 | 392.79 | 415.35 | 423.68 | 456.32 | 489.19 |
| MU | 126.0 | 110.84 | 25.67 | 64.62 | 94.80 | 114.14 | 123.05 | 168.78 |
| NOW | 126.0 | 938.54 | 77.47 | 721.65 | 900.20 | 952.69 | 1005.07 | 1044.69 |
| NVDA | 126.0 | 150.45 | 27.87 | 94.30 | 132.04 | 157.86 | 176.09 | 186.58 |
| ORCL | 126.0 | 208.23 | 54.68 | 122.35 | 157.22 | 220.37 | 244.40 | 327.76 |
| PANW | 126.0 | 188.61 | 13.03 | 152.44 | 181.33 | 192.04 | 198.28 | 208.19 |
| PLTR | 126.0 | 140.03 | 28.18 | 74.01 | 123.33 | 140.69 | 158.29 | 186.97 |
| QCOM | 126.0 | 151.98 | 9.63 | 123.21 | 145.91 | 153.42 | 158.34 | 173.55 |
| SNPS | 126.0 | 516.39 | 73.19 | 380.90 | 466.18 | 501.10 | 595.16 | 645.35 |
| TXN | 126.0 | 183.93 | 19.09 | 142.07 | 177.45 | 184.28 | 196.50 | 217.72 |
Daiy returns¶
The daily return of stock i on day t is computed as:
$$ r_{i,t} = \frac{P_{i,t} - P_{i,t-1}}{P_{i,t-1}} $$where:
- $r_{i,t}$ : daily return of stock i at time t
- $P_{i,t}$ : closing price of stock i on day t
- $P_{i,t-1}$ : closing price of stock i on the previous day
#Compute the daily returns
returns = data.pct_change().dropna().round(4)
#Preview the data
head_rows = returns.head(5)
tail_rows = returns.tail(5)
ellipsis_row = pd.DataFrame([["..."] * returns.shape[1]], columns=returns.columns, index=["..."])
display(pd.concat([head_rows, ellipsis_row, tail_rows]).T)
| 2025-04-02 00:00:00 | 2025-04-03 00:00:00 | 2025-04-04 00:00:00 | 2025-04-07 00:00:00 | 2025-04-08 00:00:00 | ... | 2025-09-24 00:00:00 | 2025-09-25 00:00:00 | 2025-09-26 00:00:00 | 2025-09-29 00:00:00 | 2025-09-30 00:00:00 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker | |||||||||||
| AAPL | 0.0031 | -0.0925 | -0.0729 | -0.0367 | -0.0498 | ... | -0.0083 | 0.0181 | -0.0055 | -0.004 | 0.0008 |
| ACN | 0.0088 | -0.047 | -0.0544 | -0.0012 | -0.0117 | ... | 0.0152 | -0.0273 | 0.0276 | 0.0336 | -0.0016 |
| ADBE | 0.0067 | -0.048 | -0.0495 | -0.024 | -0.0021 | ... | -0.0235 | 0.0025 | 0.0175 | -0.0026 | -0.0186 |
| ADI | 0.0021 | -0.0937 | -0.09 | 0.0409 | -0.0306 | ... | 0.0074 | -0.0043 | 0.0001 | -0.0112 | 0.0037 |
| AMAT | 0.0143 | -0.0828 | -0.0632 | 0.0465 | -0.0293 | ... | 0.0028 | -0.0091 | 0.0216 | 0.0051 | -0.001 |
| AMD | 0.0018 | -0.089 | -0.0857 | -0.0247 | -0.0649 | ... | -0.0001 | 0.0024 | -0.0112 | 0.0119 | 0.0027 |
| ANET | 0.0213 | -0.1109 | -0.0968 | 0.059 | 0.0195 | ... | -0.0101 | 0.0029 | -0.0039 | 0.0061 | 0.0163 |
| APH | 0.0277 | -0.0772 | -0.057 | 0.0306 | -0.0136 | ... | -0.0181 | -0.0065 | 0.0022 | -0.013 | 0.0226 |
| APP | 0.0272 | -0.0978 | -0.1626 | 0.0586 | 0.0132 | ... | -0.0142 | -0.0031 | 0.0468 | 0.0634 | 0.0087 |
| AVGO | 0.0212 | -0.1051 | -0.0501 | 0.0537 | 0.0123 | ... | 0.0011 | -0.0095 | -0.0047 | -0.0198 | 0.0061 |
| CDNS | 0.0238 | -0.0605 | -0.0644 | 0.004 | -0.0093 | ... | -0.0255 | -0.0165 | -0.0027 | -0.0045 | 0.0079 |
| CRM | 0.005 | -0.0601 | -0.0567 | 0.0143 | -0.0009 | ... | 0.0054 | -0.0201 | 0.0103 | 0.0069 | -0.033 |
| CRWD | 0.0251 | -0.0649 | -0.0742 | 0.0085 | 0.0021 | ... | -0.0161 | -0.0068 | 0.0176 | 0.0146 | 0.004 |
| CSCO | 0.0003 | -0.0668 | -0.0483 | -0.0024 | -0.0224 | ... | -0.0033 | 0.0079 | -0.0093 | 0.0074 | 0.0103 |
| IBM | -0.0014 | -0.026 | -0.0658 | -0.0075 | -0.021 | ... | -0.0173 | 0.052 | 0.0102 | -0.0159 | 0.0084 |
| INTC | -0.0032 | 0.0205 | -0.115 | -0.0141 | -0.0736 | ... | 0.0641 | 0.0887 | 0.0444 | -0.0287 | -0.027 |
| INTU | 0.0116 | -0.036 | -0.0618 | -0.0094 | -0.0219 | ... | -0.0063 | -0.003 | 0.0081 | -0.0051 | -0.017 |
| KLAC | 0.0055 | -0.0953 | -0.0713 | 0.0487 | -0.0085 | ... | -0.0024 | -0.009 | 0.0049 | -0.0002 | 0.0136 |
| LRCX | 0.013 | -0.116 | -0.094 | 0.0526 | -0.0314 | ... | -0.0254 | -0.0015 | 0.0016 | 0.0215 | 0.0214 |
| MSFT | -0.0001 | -0.0236 | -0.0356 | -0.0055 | -0.0092 | ... | 0.0018 | -0.0061 | 0.0087 | 0.0061 | 0.0065 |
| MSI | 0.0023 | -0.0034 | -0.0766 | 0.0025 | -0.0208 | ... | -0.0331 | 0.0013 | 0.0017 | -0.0051 | 0.0068 |
| MU | -0.0012 | -0.1609 | -0.1294 | 0.0564 | -0.0414 | ... | -0.0282 | -0.0302 | 0.0028 | 0.0422 | 0.0209 |
| NOW | 0.0154 | -0.0606 | -0.0677 | 0.0192 | -0.0107 | ... | 0.0061 | -0.0158 | 0.0189 | 0.0052 | -0.0219 |
| NVDA | 0.0025 | -0.0781 | -0.0736 | 0.0353 | -0.0137 | ... | -0.0082 | 0.0041 | 0.0028 | 0.0205 | 0.026 |
| ORCL | 0.0276 | -0.0592 | -0.0653 | -0.0087 | -0.0209 | ... | -0.0171 | -0.0555 | -0.027 | -0.0025 | -0.0054 |
| PANW | 0.0109 | -0.0463 | -0.0702 | -0.0074 | 0.0006 | ... | -0.0125 | 0.0075 | 0.0008 | 0.0079 | -0.0017 |
| PLTR | 0.0327 | -0.044 | -0.1147 | 0.0517 | -0.0067 | ... | -0.0164 | -0.0025 | -0.0087 | 0.0073 | 0.0199 |
| QCOM | 0.0067 | -0.0951 | -0.0858 | 0.0177 | -0.039 | ... | 0.0237 | -0.0223 | -0.0028 | -0.023 | 0.0064 |
| SNPS | 0.006 | -0.0474 | -0.0709 | -0.0186 | 0.0018 | ... | -0.0453 | 0.0408 | 0.0011 | -0.0126 | 0.0245 |
| TXN | 0.0011 | -0.0785 | -0.078 | 0.0172 | -0.0519 | ... | 0.0132 | -0.013 | 0.0138 | -0.0072 | 0.0027 |
Covariance Matrix¶
Now, let's compute the covariance matrix from the daily returns of the 30 XLK holdings to measure how their returns move together.
cov_matrix = returns.cov()
print("✅ Covariance matrix (first 5x5):")
display(cov_matrix.iloc[:10, :10])
✅ Covariance matrix (first 5x5):
| Ticker | AAPL | ACN | ADBE | ADI | AMAT | AMD | ANET | APH | APP | AVGO |
|---|---|---|---|---|---|---|---|---|---|---|
| Ticker | ||||||||||
| AAPL | 0.000620 | 0.000223 | 0.000258 | 0.000471 | 0.000471 | 0.000584 | 0.000403 | 0.000315 | 0.000494 | 0.000419 |
| ACN | 0.000223 | 0.000352 | 0.000203 | 0.000309 | 0.000288 | 0.000264 | 0.000262 | 0.000155 | 0.000353 | 0.000226 |
| ADBE | 0.000258 | 0.000203 | 0.000331 | 0.000290 | 0.000245 | 0.000282 | 0.000253 | 0.000167 | 0.000299 | 0.000238 |
| ADI | 0.000471 | 0.000309 | 0.000290 | 0.000776 | 0.000664 | 0.000747 | 0.000482 | 0.000406 | 0.000633 | 0.000596 |
| AMAT | 0.000471 | 0.000288 | 0.000245 | 0.000664 | 0.000961 | 0.000757 | 0.000499 | 0.000415 | 0.000624 | 0.000634 |
| AMD | 0.000584 | 0.000264 | 0.000282 | 0.000747 | 0.000757 | 0.001373 | 0.000468 | 0.000474 | 0.000752 | 0.000707 |
| ANET | 0.000403 | 0.000262 | 0.000253 | 0.000482 | 0.000499 | 0.000468 | 0.001239 | 0.000449 | 0.000842 | 0.000681 |
| APH | 0.000315 | 0.000155 | 0.000167 | 0.000406 | 0.000415 | 0.000474 | 0.000449 | 0.000453 | 0.000540 | 0.000454 |
| APP | 0.000494 | 0.000353 | 0.000299 | 0.000633 | 0.000624 | 0.000752 | 0.000842 | 0.000540 | 0.001960 | 0.000767 |
| AVGO | 0.000419 | 0.000226 | 0.000238 | 0.000596 | 0.000634 | 0.000707 | 0.000681 | 0.000454 | 0.000767 | 0.001005 |
The table is difficult to understand intuitively, so we can visualize it:
import matplotlib.pyplot as plt
import seaborn as sns
cov_matrix = returns.cov()
plt.figure(figsize=(12, 8))
sns.heatmap(cov_matrix,
cmap='RdBu_r',
center=0,
annot=False,
square=True,
cbar_kws={'label': 'Covariance'})
plt.title("Covariance Matrix of XLK 30 Holdings", fontsize=14)
plt.tight_layout()
plt.show()
The covariance matrix illustrates how daily returns of XLK’s 30 technology holdings move together. The darker the color, the higher co-movement between two securities. Most cells show positive covariances (in light to dark red), indicating that the majority of stocks tend to rise and fall in tandem, reflects a strong sector-wide co-movement, typical of large-cap technology firms.
A few darker red blocks along the diagonal highlight pairs or subgroups with particularly strong relationships, such as semiconductor stocks (e.g., NVIDIA, AMD, and TXN) and software giants (e.g., Microsoft and Adobe). This clustering suggests the presence of industry-specific factors in addition to the broad market trend.
Meanwhile, a handful of light or slightly blue areas represent weak or mildly negative covariances, meaning certain stocks move somewhat independently from the rest, possibly due to differing business models or diversification within the ETF.
Let's see the PCA result on daily returns:
#Initiate the PCA model and fit the returns
pca = PCA()
pca.fit(returns)
#Result of variance contribution
explained_var = pca.explained_variance_ratio_
print("Explained Variance Ratio for each component:")
for i, var in enumerate(explained_var, start=1):
print(f"Component {i}: {var:.4f} ({var*100:.2f}%)")
print(f"The first 10 component explain:", round((sum(explained_var[:10])*100),2),"% of the total variance")
Explained Variance Ratio for each component: Component 1: 0.5064 (50.64%) Component 2: 0.1154 (11.54%) Component 3: 0.0614 (6.14%) Component 4: 0.0456 (4.56%) Component 5: 0.0391 (3.91%) Component 6: 0.0308 (3.08%) Component 7: 0.0264 (2.64%) Component 8: 0.0217 (2.17%) Component 9: 0.0200 (2.00%) Component 10: 0.0169 (1.69%) Component 11: 0.0154 (1.54%) Component 12: 0.0134 (1.34%) Component 13: 0.0111 (1.11%) Component 14: 0.0097 (0.97%) Component 15: 0.0085 (0.85%) Component 16: 0.0075 (0.75%) Component 17: 0.0070 (0.70%) Component 18: 0.0059 (0.59%) Component 19: 0.0052 (0.52%) Component 20: 0.0045 (0.45%) Component 21: 0.0044 (0.44%) Component 22: 0.0042 (0.42%) Component 23: 0.0039 (0.39%) Component 24: 0.0030 (0.30%) Component 25: 0.0030 (0.30%) Component 26: 0.0026 (0.26%) Component 27: 0.0023 (0.23%) Component 28: 0.0019 (0.19%) Component 29: 0.0017 (0.17%) Component 30: 0.0013 (0.13%) The first 10 component explain: 88.37 % of the total variance
As we can see, the first 10 components explain 88.37% of the total variance. We can visualize this with a Scree Plot:
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(explained_var) + 1), explained_var * 100, 'o-', linewidth=2)
plt.title("Scree Plot - Variance Explained by Principal Components", fontsize=12)
plt.xlabel("Principal Component", fontsize=11)
plt.ylabel("Variance Explained (%)", fontsize=11)
plt.xticks(np.arange(1, len(explained_var) + 1))
plt.grid(True, alpha=0.3)
plt.show()
Singular Value Decomposition¶
Alternatively, we can use Singular Value Decomposition (SVD) to decompose the return matrix into orthogonal components. This is a numerically stable method t perform PCA directly on the returns matrix.
#Calculate X, U, S, VT matrices
X = returns - returns.mean()
U, S, VT = np.linalg.svd(X, full_matrices=False)
print("Shapes:")
print("U:", U.shape, " | S:", S.shape, " | VT:", VT.shape)
#Compute the Eigenvalues
eigenvalues = (S**2) / (len(X)-1)
print("\n First 5 singular values:")
print(S[:5])
print("\n Corresponding eigenvalues (variance explained):")
print(eigenvalues[:5])
Shapes: U: (125, 30) | S: (30,) | VT: (30, 30) First 5 singular values: [1.30000094 0.62049368 0.45254319 0.38993217 0.36134511] Corresponding eigenvalues (variance explained): [0.01362905 0.00310494 0.00165158 0.00122619 0.00105299]
The return matrix (125 days × 30 stocks) was decomposed into three matrices:
- U (125 × 30) represents the time-series weights of each component.
- S (30,) contains the singular values, showing the strength of each factor.
- Vᵀ (30 × 30) gives the loadings of each stock on the components.
The first five singular values are [1.30, 0.62, 0.45, 0.39, 0.36], indicating that the first few components dominate the structure. The corresponding eigenvalues [0.0136, 0.0031, 0.0017, 0.0012, 0.0011] confirm that most of the total variance is captured by the leading components, consistent with the PCA results.
Conclusion¶
This project explored the structure and dynamics of financial markets through three complementary analyses:
(1) modeling yield curves, (2) uncovering correlation patterns in bond yields, and (3) identifying common factors in equity returns. Together, these steps demonstrate how mathematical models and statistical tools can transform raw financial data into interpretable insights about market behavior.
In the first part, both the Nelson–Siegel model and Cubic Spline were used to fit the Vietnamese Treasury yield curve. Despite limited observations, the model effectively captured the smooth, upward-sloping term structure using only four economically meaningful parameters: level, slope, curvature, and decay. This shows its strength in both interpretability and parsimony.
Next, by analyzing Australian government bond yield changes with Principal Component Analysis (PCA), we revealed that most of the variation (≈98%) is driven by just two or three latent factors. These correspond to well-known fixed-income movements: level shifts, slope changes, and curvature twists, confirming the existence of a low-dimensional structure in bond market dynamics.
Finally, the Empirical Analysis of the XLK ETF extended the same framework to the equity market. The covariance matrix and PCA results demonstrated that technology stocks exhibit strong co-movement, dominated by a single market-wide component explaining over 50% of total variance, followed by smaller sub-sector and firm-specific effects. Alternatively, Singular Value Decomposition (SVD) confirmed this hierarchy, showing that only a few orthogonal components account for most of the market’s behavior.
Across both fixed income and equity domains, the findings emphasize a consistent principle:
Financial systems, while high-dimensional in appearance, are governed by a small number of underlying factors that capture most of their dynamics.