Tutorial

Web Scraping for Machine Learning with ProxyTee

April 11, 2025 Mike

Web scraping’s importance has surged, becoming essential across various businesses. Its ability to automate, deliver quick results, provide cost-effectiveness, and drive data-based market analysis underscores its growing necessity. But how will machine learning (ML) influence data scraping techniques?

Understanding Machine Learning

Machine learning, a core part of data science, mimics human learning by using algorithms to analyze data. This approach automates processes, requiring minimal manual coding from developers. It applies to many areas such as:

Customer Service: AI-powered chatbots are taking over customer service, providing instant answers to common queries.
Web Unblocking: AI and ML driven proxy solutions that enable smooth data gathering without blocks and errors.
Computer Vision: Machine learning extracts insights from visual data, enabling recognition tasks like those in self-driving cars.
Stock Trading: Automated trading powered by algorithms optimizes stock portfolios.

Web Scraping’s Crucial Role in Machine Learning

Web scraping is vital for gathering the high-quality data needed for machine learning. Although internal data can be useful, it is limited. Scraping external sources is vital to gather data points that are more comprehensive and can deliver better results. This is why the need for more sophisticated tools for data gathering is growing.

In this post, we’ll explore how to combine web scraping and machine learning to analyze stock prices, using ProxyTee to make data gathering process smoother and quicker.

From Web to Model: Data Preparation in Action

1️⃣ Project Setup and Requirements

We will use Python 3.9, along with the following libraries:

Web scraping: Requests-HTML and BeautifulSoup4
Machine learning: Pandas, Numpy, Matplotlib, Seaborn, SciKit Learn, and Tensorflow.

Install the required libraries:

$ python3 -m pip install requests_html beautifulsoup4
$ python3 -m pip install pandas numpy matplotlib seaborn tensorflow scikit-learn keras

2️⃣ Data Extraction and Preparation

We will use Jupyter Notebook to execute and demonstrate code and graphs.

First, import the libraries:

from requests_html import HTMLSession
import pandas as pd

Use Requests-HTML to extract the HTML from the target webpage.

url = 'https://finance.yahoo.com/quote/AAPL/history?p=AAPL&guccounter=1&period1=1556113078&period2=1713965616'
session = HTMLSession()
r = session.get(url)

Using XPath, extract the necessary data into a list of dictionaries. Here’s the full Python code:

rows = r.html.xpath('//table/tbody/tr')
symbol = 'AAPL'
data = []

for row in rows:
    if len(row.xpath('.//td')) < 7:
        continue
    data.append({
        'Symbol': symbol,
        'Date': row.xpath('.//td[1]/text()')[0],
        'Open': row.xpath('.//td[2]/text()')[0],
        'High': row.xpath('.//td[3]/text()')[0],
        'Low': row.xpath('.//td[4]/text()')[0],
        'Close': row.xpath('.//td[5]/text()')[0],
        'Adj Close': row.xpath('.//td[6]/text()')[0],
        'Volume': row.xpath('.//td[7]/text()')[0]
    })

df = pd.DataFrame(data)

Convert this list of dictionaries to a Pandas DataFrame to store the collected data.

3️⃣ Cleaning Data for Machine Learning

Data cleaning is necessary before training the model.

First, convert the ‘Date’ column to DateTime format.
Then, convert numeric columns from strings to float.
Remove commas in the values.
Drop missing values.
Set date as the index column of DataFrame.

df['Date'] = pd.to_datetime(df['Date'])
str_cols = ['High', 'Low', 'Close', 'Adj Close', 'Volume']
df[str_cols] = df[str_cols].replace(',', '', regex=True).astype(float)
df.dropna(inplace=True)
df = df.set_index('Date')
df.head()

4️⃣ Data Visualization

To see a trend in price data, let’s first visualize the closing stock price of Apple.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
plt.style.use('ggplot')
plt.figure(figsize=(15, 6))
df['Adj Close'].plot()
plt.ylabel('Adj Close')
plt.xlabel(None)
plt.title('Closing Price of AAPL')
plt.show()

5️⃣ Preparing Data for Machine Learning Model

We choose to use ‘Open’, ‘High’, ‘Low’, ‘Volume’ as the training set features and ‘Adj Close’ as the variable to be predicted.

features = ['Open', 'High', 'Low', 'Volume']
y = df.filter(['Adj Close'])
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(df[features])

Split data into training and testing data, using TimeSeriesSplit, then reshape it for use with LSTM networks.

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=10)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])

6️⃣ Training the Model

Build a model using sequential and Dense layers with LSTM:

from keras.models import Sequential
from keras.layers import LSTM, Dense

model = Sequential()
model.add(LSTM(32, activation='relu', return_sequences=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs=100, batch_size=8)

Now that we have finished training the model, we are going to use this for a prediction and then compare both predicted value against the original.

y_pred= model.predict(X_test)

plt.figure(figsize=(15, 6))
plt.plot(y_test.values, label='Actual Value')
plt.plot(y_pred, label='Predicted Value')
plt.ylabel('Adjusted Close (Scaled)')
plt.xlabel('Time Scale')
plt.legend()
plt.show()

This shows the prediction is generally similar to the actual stock trends.

7️⃣ Full Code:

from requests_html import HTMLSession
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit
from keras.models import Sequential
from keras.layers import LSTM, Dense

url = 'https://finance.yahoo.com/quote/AAPL/history?p=AAPL&guccounter=1&period1=1556113078&period2=1713965616'
session = HTMLSession()
r = session.get(url)
rows = r.html.xpath('//table/tbody/tr')
symbol = 'AAPL'
data = []

for row in rows:
    if len(row.xpath('.//td')) < 7:
        continue
    data.append({
        'Symbol': symbol,
        'Date': row.xpath('.//td[1]/text()')[0],
        'Open': row.xpath('.//td[2]/text()')[0],
        'High': row.xpath('.//td[3]/text()')[0],
        'Low': row.xpath('.//td[4]/text()')[0],
        'Close': row.xpath('.//td[5]/text()')[0],
        'Adj Close': row.xpath('.//td[6]/text()')[0],
        'Volume': row.xpath('.//td[7]/text()')[0]
    })

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
str_cols = ['High', 'Low', 'Close', 'Adj Close', 'Volume']
df[str_cols] = df[str_cols].replace(',', '', regex=True).astype(float)
df.dropna(inplace=True)
df = df.set_index('Date')
df.head()
sns.set_style('darkgrid')
plt.style.use('ggplot')
plt.figure(figsize=(15, 6))
df['Adj Close'].plot()
plt.ylabel('Adj Close')
plt.xlabel(None)
plt.title('Closing Price of AAPL')
plt.show()

features = ['Open', 'High', 'Low', 'Volume']
y = df.filter(['Adj Close'])
scaler = MinMaxScaler()
X = scaler.fit_transform(df[features])
tscv = TimeSeriesSplit(n_splits=10)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
model = Sequential()
model.add(LSTM(32, activation='relu', return_sequences=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs=100, batch_size=8)
y_pred= model.predict(X_test)
plt.figure(figsize=(15, 6))
plt.plot(y_test.values, label='Actual Value')
plt.plot(y_pred, label='Predicted Value')
plt.ylabel('Adjusted Close (Scaled)')
plt.xlabel('Time Scale')
plt.legend()
plt.show()

Conclusion

This post demonstrated how ProxyTee can be used together with web scraping and machine learning to forecast stock prices. With ProxyTee, we can guarantee a consistent access and performance for this important process.

Our Unlimited Residential Proxies, which offers rotating residential IPs, along with unlimited bandwidth and global coverage are crucial components to ensure smooth and reliable web scraping experiences. Also, auto rotation helps in avoiding any detection or blocks when gathering data from different websites. The simple API further enables easy integration of these services to different applications and workflows.

For those looking to efficiently scrape web data, ProxyTee provides robust solutions. From our affordable pricing to various Residential Proxy and Datacenter Proxy options, we can accommodate a range of requirements and tasks.

Web Scraping for Machine Learning with ProxyTee

Understanding Machine Learning

Web Scraping’s Crucial Role in Machine Learning

From Web to Model: Data Preparation in Action

1️⃣ Project Setup and Requirements

2️⃣ Data Extraction and Preparation

3️⃣ Cleaning Data for Machine Learning

4️⃣ Data Visualization

5️⃣ Preparing Data for Machine Learning Model

6️⃣ Training the Model

7️⃣ Full Code:

Conclusion

We help ambitious businesses achieve more

Products

Tools

Legal

Support

Contact sales

Web Scraping for Machine Learning with ProxyTee

Understanding Machine Learning

Web Scraping’s Crucial Role in Machine Learning

From Web to Model: Data Preparation in Action

1️⃣ Project Setup and Requirements

2️⃣ Data Extraction and Preparation

3️⃣ Cleaning Data for Machine Learning

4️⃣ Data Visualization

5️⃣ Preparing Data for Machine Learning Model

6️⃣ Training the Model

7️⃣ Full Code:

Conclusion

Related Posts

The Future of Web Scraping: Next-Gen Residential Proxies with ProxyTee

How to Master HTTP Headers Using cURL with Help from ProxyTee

Scraping Bing Search Results in 2025 with ProxyTee

We help ambitious businesses achieve more

Products

Tools

Legal

Support